Welcome to the WPI Data Science Research Group. We are a group of faculty, researchers, and students who are working on database projects. Our group focuses on research issues and project work related to very large database and information systems in support of advanced applications including business, engineering, and sciences, large-scale data analytics, scientific data management, annotation and provenance management, multi-dimensional query processing and optimizations.
Currently on-going projects include intelligent event analytics, scalable data stream processing systems, map-reduce technologies, biological databases, stream mining and discovery, large-scale visual information exploration, medical process tracking, and distributed heterogeneous information sources, to just name a few. We thrive to build software systems to evaluate the feasibility of our innovations and to evaluate their usefulness by applying them to real problems.
Data Integration
|
|
MATTERS: Massachusetts Technology, Talent, and Economic Reporting System
Read more...
The Massachusetts Technology, Talent, and Economic Reporting System (MATTERS) is
an online analytics dashboard empowered by a powerful dynamic data integration infrastructure. Extracting data sets
across various public government data sites, the system allows users to quickly access, analyze and visualize
a number of key factors impacting the economic competitiveness of US states. This project is a collaboration between
the Massachusetts High Technology Council (MHTC) and Worcester Polytechnic Institute. Under the supervision of
Professor Elke Rundensteiner, students at WPI have worked with experts from high tech industry, research organizations,
and higher education institutions developing this tool.
|
Complex event stream processing
|
|
CEA: Complex Event Analytics
Read more...
The recent advances in hardware and software have enabled the capture of different measurements
of data in a wide range of fields. Applications that generate rapid, continuous and large volumes of event streams include
readings from sensors used in applications, such as physics, biology and chemistry experiments, weather sensors, health sensors,
network sensors, online auctions, credit card operations, financial tickers, web server log records, etc.
Given these developments, the world is poised for a sea-change in terms of variety, scale and importance of applications
that can be envisioned based on the real-time analysis and exploitation of such event stream for decision making -
from dynamic traffic management, environmental monitoring to health care alike.
Clearly, the ability to infer relevant patterns from these event streams in real-time to make near instantaneous yet informed decisions,
henceforth called complex event analytics, is absolutely crucial for these mission critical applications.
|
|
HIT: Hierarchical Instantiating Timed automaton
Read more...
Real-time reactive applications from supply chain tracking to health care data analytics
have increasingly gained on importance and complexity.
To facilitate the specification of involved event-based application semantics,
we introduce a novel model HIT that finds a middle ground between
a specification composed of a large set of low-level queries versus
a high-level graphical workflow description.
The workflow is captured by the Hierarchical Instantiating Timed automaton (HIT),
while succinct queries are formulated within its states
which provide valuable context for launching query execution.
HIT models an arbitrary number of event-driven
sequential or concurrent hierarchical processes as
required for realization of complex real-world applications
using a succinct yet expressive specification.
The effectiveness of HIT is illustrated by a full case study of
the auction scenario.
|
Annotations in relational databases
|
|
PrefNotes: A Framework for Personalized Annotation Management in Relational Databases
Read more...
Annotations play a key role in understanding and describing the data,
and annotation management has become an integral component in most emerging
applications such as scientific databases. Scientists need to exchange not
only data but also their thoughts, comments and annotations on the data as well.
Annotation represent comments, Lineage of data, description and much more.
Therefore, several annotation management techniques have been proposed to
efficiently and abstractly handle the annotations. However, with the
increasing scale of collaboration and the extensive
use of annotations among users and scientists, the number and size of the
annotations may far exceed the size of the original data itself.
Therefore, among the so many existing annotations, different users may have
different preferences and only small number of annotations can be of interest
to each user based on his (her) preferences. Current annotation management
techniques report all annotations to users without taking into account their preferences.
We propose PrefNotes, a framework for personalized annotation
propagation in relational databases. PrefNotes captures users’ preferences and
profiles and personalizes the annotation propagation at query time by reporting
the Top K most relevant annotations (per tuple) for each user. PrefNotes supports
static and dynamic profiles for each user. We propose three variants of Top K operators
namely fixed, proportional, and approximate proportional operators that differ in their
cost model and accuracy. PrefNotes is implemented inside PostgreSQL.
|
|
InsightNotes: Supporting Annotations Beyond Propagation In Relational Databases
Read more...
Scientific database systems provide backbone support to various scientific applications.
In these applications, efficient and effective annotation management mechanism is vital
for sharing knowledge and establishing a collaborative environment among end-users and scientists.
Annotations may represent comments on the data, provenance or lineage information, and highlights
on conflicting or erroneous values.
The extensive use of annotation and the expanding scale of collaboration may cause the size of
annotations to far exceed the size of the original data, and hence
it becomes extremely difficult for end-users to extract useful insight and the valuable knowledge hidden within the annotations.
In this project, we propose the InsightNotes system; an advanced annotation management
system over relational databases, for exploiting annotations in novel ways through summarization, mining, and ranking
techniques with the objective of reporting concise and meaningful representations instead of the raw annotations.
InsightNotes also address the query processing challenges involved in building and querying such complex representations.
|
Scalable Data Mining Technologies
|
|
A Framework for Analyzing Text Data Streams in Social Microblogging Networks
Read more...
An enormous amount of data exists in massive scale, either static or in the form of data streams.
These massive data contains interesting and useful information. For instance, social micro-blogging
sites such as Twitter contain large amounts of messages. Some of these messages contain valuable
information about a wide variety of real-world events. Analyzing such a data stream presents
signicant opportunities, as well as challenges. This research project focuses on online identication
of emerging trends and topics of discussion and explore the evolution of the topics over time.
Identifying trending topics in real time on Twitter is a
challenging problem, due to fast evolving and large scale of
the unstructured data. To tackle these challenges, the system should provide these features:
(1) The system should be able to process data in a single-pass.
(2) The mining algorithm must be fast and scalable to handle the massive data in real time.
(3) The mining method needs to be executed incrementally in online fashion.
(4) Also it should be able to handle outliers and evolution of data.
The continuous evolution of data makes it essential to quickly identify new trends in the data.
|
|
XMDV: Visual Exploration Support for Data Mining and Discovery
Read more...
XmdvTool is a public-domain software package for the interactive visual exploration of multivariate data sets.
It is available on all major platforms such as UNIX, LINUX, MAC and Windows. XmdvTool is developed using Qt and Eclipse CDT.
It supports five methods for displaying flat form data and hierarchically clustered data:
(1) Scatterplots, (2) Star Glyphs, (3) Parallel Coordinates, (4) Dimensional Stacking, and (5) Pixel-oriented Display.
XmdvTool also supports a variety of interaction modes and tools, including brushing in screen, data, and structure spaces, zooming,
panning, and distortion techniques, and the masking and reordering of dimensions. Univariate display and graphical summarization,
via tree-maps and modified Tukey box plots, are also supported. Finally, color themes and user customizable color assignments
permit tailoring of the aesthetics to the users.
XmdvTool has been applied to a wide range of application areas, some of which are highlighted in our Case Studies.
Some of these domains include remote sensing, financial, geochemical, census, and simulation data. We are always looking
for new applications, so if you've had some success with the system in your domain, we'd love to hear from you.
See our contact page and join our user group if you'd like to contribute something or get further information.
|
|
Distributed Scalable Outliers in Big Data
Read more...
Distance-based Outlier Detection is a popular and fundamental task for data analysis.
However, the potential quadratic time complexity is impeding its usefulness for large scale data.
We address this issue and propose a distributed algorithm that scales on large data.
We also embrace Map-Reduce, a popular shared nothing distributed system, as the platform.
This work outperforms existing solution by:
(1) Reduce the replica transportation cost (between different physical machines);
(2) Guarantee load balancing over data skewness.
|
|
PARAS: A Parameter Space Framework for Online Association Mining
Read more...
Association rule mining is known to be computationally intensive,
yet real-time decision-making applications are increasingly intolerant
to delays. In this paper, we introduce the parameter space
model, called PARAS. PARAS enables efficient rule mining by
compactly maintaining the final rulesets. The PARAS model is
based on the notion of stable region abstractions that form the
coarse granularity ruleset space. Based on new insights on the
redundancy relationships among rules, PARAS establishes a surprisingly
compact representation of complex redundancy relationships
while enabling efficient redundancy resolution at query-time.
Besides the classical rule mining requests, the PARAS model supports
three novel classes of exploratory queries. Using the proposed
PSpace index, these exploratory query classes can all be answered
with near real-time responsiveness. Our experimental evaluation
using several benchmark datasets demonstrates that PARAS
achieves 2 to 5 orders of magnitude improvement over state-of-theart
approaches in online association rule mining.
|
Scalable data stream processing
|
|
QueryMesh: A Novel Paradigm for Query Processing
Read more...
Technological advances in positioning, sensor and monitoring technology drive data acquisition devices to generate massive streams of data.
The goal of this research is to develop a new class of high-performance stream data management systems capable of coping with scenarios with
infinite data arriving in large volumes, and with near-real time response requirements. The proposed query processing paradigm, termed the
multi-route query mesh model (QM), overcomes a major limitation in current query optimizers, both static and stream ones alike, namely the
assignment of a single `best' query execution plan for all input data. This approach, being based on the strong assumption of data uniformity,
results in substandard performance for possibly all data items. Instead, query mesh adopts a processing structure composed of a data classifier
and a multiple route plan infrastructure. Different learning models can be plugged as classifier logic into the QM model. Given the complexity
of the QM solution space, cost-based search heuristics are designed to efficiently find high-quality query meshes. QM is adaptive supporting the
detection and incremental modification of the QM classifier and its routes. Intellectual merit lies in the design, development and evaluation of
a novel multi-route paradigm for stream query processing, -- a perfect middle ground between the two current extremes of single-plan versus route-less
solutions. Experimental studies compare query mesh to state-of-the-art solutions. QM impacts society by facilitating a wide range of stream-centric
applications, including medical out-patient monitoring, emergency management, and business intelligence processing, and by integrating project activities with education.
|
|
Distributed Scalable Outliers in Big Data
Read more...
Distance-based Outlier Detection is a popular and fundamental task for data analysis.
However, the potential quadratic time complexity is impeding its usefulness for large scale data.
We address this issue and propose a distributed algorithm that scales on large data.
We also embrace Map-Reduce, a popular shared nothing distributed system, as the platform.
This work outperforms existing solution by:
(1) Reduce the replica transportation cost (between different physical machines);
(2) Guarantee load balancing over data skewness.
|
|
RAINDROP: XQueries Over XML Streams (Automaton Meets Algebra)
Read more...
As XML becomes popular, more and more stream sources exist in the XML format. Typical XML stream applications include XML message
brokers for B2B message-oriented middleware servers and selective dissemination of information such as personalized newspaper delivery.
The general goal of raindrop is to tackle challenges of stream processing that are specific to XML, in particular, processing XQuery, a
standard XML query language, over XML streams.
It is important to note that unlike tuple-based or object-based data streams, XML streams are more appropriately modeled as a sequence
of primitive tokens, such as a start tag, an end tag or a PCDATA item. However, a token is not self-contained, compared to a tuple that
is a self-contained structure whose semantics are completely determined by its own values. A token, on the other hand, lacks semantics
without the context provided by other tokens in the stream. Structural pattern retrieval, one of three functionalities in an XQuery
(the other two are filtering and restructuring), has to be first performed on these non-self-contained tokens to compose self-contained objects.
While the automata model is naturally suited for pattern matching on tokenized XML streams, the algebraic model in contrast is a
well-established technique in database systems for set-oriented processing of self-contained data units, i.e., tuples. However,
neither automata models nor algebraic models are well-equipped to handle both computation paradigms. The goal of the Raindrop project
is now to accommodate these two paradigms within one uniform algebraic framework, thus taking advantage of both. In our query model,
both tokenized data and self-contained tuples are supported in a uniform manner. Query plans in this way can be flexibly rewritten
using algebra-like equivalence rules to change what computation is done using tokenized data versus tuples. Raindrop system has four
levels of abstractions in its system framework, namely, semantics-focused plan, stream logical plan, stream physical plan and execution plan.
Various optimization techniques are provided at each level.
|
|
CAPE: Continuous Adaptive Processing Engine
Read more...
The growth of electronic commerce and the widespread use of sensor networks have created the demand for online processing and monitoring applications,
creating a new class of query processing over continuously generated data streams. Traditional database techniques, which assume data to be bounded as
well as statically stored and indexed, are largely incapable of handling these new applications, and so Continuous Query (CQ) Systems have appeared.
CQ systems must be adaptive to properly manage their available resources in the face of data streams with widely varying arrival rates, and a constantly
changing set of standing user queries that must be processed. Not a priori optimization algorithm can be successful given such variability.
The CAPE project aims to propose a novel architecture for a CQ system that (1) incorporates adaptability at all levels of query processing;
and (2) incorporates a dynamic metadata model used to help optimize all levels of query processing.
The CAPE project aims to provide novel techniques for processing large numbers of concurrent continuous queries with required Quality of Service (QoS).
Because of the dynamic nature of query registration and stream behavior, we are designing heterogeneous-grained adaptivity for CAPE and exploits dynamic
metadata at all levels in continuous query processing, including the query operator execution, memory allocation, operator scheduling, query plan structuring
and query plan distribution among multiple machines. We will (1) design an extensible dynamic metadata model; (2) design adaptive algorithms for use in each
layer of query processing to exploit available metadata; (3) develop QoS specification models for capturing resource usage; (4) incorporate a hierarchical
interaction model for coordinating the adaptation at different levels within the CQ system; and (5) design a family of metadata-exploiting optimization techniques.
|
Undergraduate projects
|
|
Interactive web-based dashboard for the Massachusetts High Tech Council
Read more...
The goal of this project is to build and analyze the effects of an interactive web-based dashboard for the Massachusetts High Tech Council,
a pro-technology advocacy and lobbyist organization. We conducted a survey of Massachusetts High Technology Council (MHTC) members about the
perceived effectiveness of the dashboard as well as a usability study of the dashboard prototype to test the ease of use. This allowed us to
better understand the impact of technology in policy making.
Students: Nilesh C Patel, Stefan Gvozdenovic, Theodore J Meyer
|