The final project counts as 30% of the student's final grade in the course. This section contains a description of the final project and suggests project ideas.
Important Dates
Key dates for the project, as listed in the calendar, are:
A large portion (30%) of your grade in 6.830 consists of a final project. This project is meant to be a substantial independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a system you are familiar with, or be a database-related project in your research area.
This document describes what is expected of a final project and proposes some possible project ideas.
What is Expected
Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research - e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.
What to Hand In
There are two written deliverables: a project proposal, due at lecture session L10, and a final report, due at project session P1.
Project Proposal
The proposal should consist of 1-2 pages describing the problem you plan to solve, outlining how you plan to solve it, and describing what you will "deliver" for the final project. We will arrange short meetings with every group before the project proposal to help you refine your topic and would be happy to provide feedback on a draft of your proposal before it is due.
Final Report
You should prepare a conference-style report on your project with maximum length of 15 pages (10 pt font or larger, one or two columns, 1" margins, single or double spaced - more is not better.) Your report should introduce and motivate the problem your project addresses, describe related work in the area, discuss the elements of your solution, and present results that measure the behavior, performance, or functionality of your system (with comparisons to other related systems as appropriate.)
Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure you proofread and edit carefully!
Please submit a paper copy of your report at project session P1. You will also be expected to give a presentation on your project in class that will provide an opportunity for you to present a short demo of your work and show what you have done to other students in the class. Details about the format of the presentation will be posted as the date gets closer.
Project Ideas
The following is a list of possible project ideas; you are not required to choose from this list - in fact, we encourage you to try to solve a problem of your own choosing!
Compression in databases. Investigate the impact of keeping data compressed in SimpleDB or another open source database; how much time is saved reading by compressed data off disk? What is the cost of decompression? Is it possible to use compression schemes (e.g., run length encoding) that allow some operations (e.g., selections) to be applied without decompressing? What is the benefit of doing that?
Auto-admin tools to recommend indices, etc. Design a tool that recommends a set of indices to build given a particular workload and a set of statistics in a database. Alternatively investigate the question of which materialized views to create in a data-warehousing system, such as C-Store.
TinyDB Projects: TinyDB is a database system that runs on networks of tiny, wireless sensing devices (e.g.,
Crossbow Motes) and allows users to collect information from the via declarative queries. There are many possible kinds of extensions to TinyDB; some ideas:
Design a simple programming environment so that users and programmers can extend TinyDB with new functions, just like users can write user defined functions in database systems.
Implement a set of operators or extend the TinyDB query language with support for some kind of signal processing or time-series analysis algorithm.
TinyDB currently floods queries to all nodes in the network, which is not a good idea if queries only apply to a few nodes. The goal of this project would be to add support for scoped queries that go only to a subset of nodes and measure the benefit of your implementation.
Make TinyDB work in the face of disconnections in the sensor network topology.
There is a simulator and hardware available for students interested in one of these projects.
Data warehousing systems (such as C-Store are often optimized to deal with so-called "star schemas" that consist of one very large "fact" table that joins with a number of smaller "dimension" tables through foreign key relationships. For example, a data warehouse might consist of a fact table that stores information about recent sales, which would join with tables about customers, stores, etc. In this project, you would devise a scheme to efficiently evaluate queries over schemas that are "near" star schemas by transforming them to star schemas and using the C-Store query optimizer.
The Selinger Optimizer (see "Access Path Selection in a Relational Database Management System" in the Red Book) is notoriously bad at making query optimization decisions when there are many joins because the statistics it computes over intermediate results are often wrong. Your goal for this project is to devise a scheme for evaluating joins that doesn't require such detailed statistics - for example, by partitioning the query plan into several smaller groups of joins that can be computed independently, and then using statistics collected during the computation of those joins to join the remaining intermediate relations. Of course, this won't be "optimal", but your goal should be to show that your scheme avoids making very bad decisions and can in-fact outperform Selinger in some situations.
Modern networking hardware often provides support for efficient broadcast and multicast of packets from a sender to multiple receivers. Suppose you had access to such primitives; your goal in this project would be to explore how to use them to make distributed query processing more efficient. We have access to a small cluster of machines that you can use to explore any techniques you develop in the context of SimpleDB.
SimpleDB is very simple. There are a number of ways you might extend it to explore some of the research ideas we have studied in this class. For example, you could add support for optimistic concurrency control and compare its performance to the basic concurrency control scheme you will implement in Problem Set 3. There are a number of other possible projects of this type; we would be happy to discuss these in more detail.
In the CarTel project, we are building a system for collecting and managing data from automobiles. One of the goals of this project is to use data collected from GPS devices on the cars to make intelligent route-planning decisions (think MapQuest or Google Maps with access to current traffic information.) To build such a smart route planner, we need a way to map GPS points (latitute, longitude) to named roads that correspond to those points. There has been some recent work in the database community on doing this mapping very efficiently; the goal of this project would be to investigate these algorithms in the context of CarTel and explore other possibly more effective techniques.
Personal Information Management: Build a utility for collecting and querying structured and unstructured data about a user's data, files, and general computing environment - for example, this tool might manage lists of URLs, automatically extracting keywords about those URLs from the associated web sites and storing them in a database so that users can search for their favorite web pages without remembering URLs. Or build a system that stores URLs and keywords with downloaded files to facilitate file search. Or build a tool that keeps a database of a users preferences files (or registry entries) and allows users to "rollback" inadvertent and / or nasty changes.
CiteSeer: The PDOS research group at MIT has access to the data for the CiteSeer system, which is currently not stored in a relational database. An interesting project would be to port the system to a database engine, and extend the functionality with some new and interesting queries enabled by SQL, or to compare the performance before and after the porting.
Database Performance in Haystack: Haystack is a "universal information client" useful for storing all sorts of personal data - email, contacts, calendars, etc. It stores data in RDF (an XML-like semi-structured representation) and uses a custom in-memory database system. The Haystack team would like to be able to replace this system with an off-the-shelf database engine, but there are some significant performance issues in doing so. An interesting project would be to analyze the cause of these performance issues, devise a set of recommendations for how to improve database performance, and implement this in Haystack.
Exploiting Structured Data in Haystack: One of the challenges of building a database system for unstructured data (e.g., data where there is no fixed schema) is that formulating queries over such data is very hard (because users don't know which fields to query). One interesting question is to try to build a set of tools that encourage users to insert data that has a similar schema as records already in the system. A similar challenge is to devise a programming environment that makes it easy for developers to query data that has no definite schema (but where something about the schema is known.)
Integrity Constraints in Streaming Systems: Current stream data processing systems do not provide any way for users to specify constraints on the data that they can handle - for example, a financial analysis application might want to verify that no purchases exceed some available balance. This project would involve defining what such constraints mean, and exploring ways to efficiently check and react to them.