Invited Talks
Algorithms for data mining: theory meets practice (in a chat room)
Abstract
In this talk I will survey some of the work going on in
Microsoft Research in the areas of machine learning and data
mining. In particular, I will talk about privacy in public
databases, statistical methods for data cleaning, and
algorithms for large-scale spectral computations. I will also
talk about our experiences with an artifact that raises
concrete, fascinating questions in each of these three areas:
the MSN Messenger "buddies network", consisting of
approximately 100 million users and 2 billion edges.
Speaker: Dimitris Achlioptas (Microsoft Research, Redmond)
Dimitris Achlioptas received his B.Eng. in Computer
Engineering from the University of Patras in 1993 and his
M.Sc. and Ph.D. in Computer Science from the University of
Toronto in 1995 and 1999. He subsequently joined Microsoft
Research as a postdoctoral fellow, where he has been a
research staff member since 2000. His research interests are
centered around the interaction of random structures with
computation. He has published in AAAI, FOCS, IJCAI, NIPS,
PODS, SODA, STOC and other conferences, as well as JACM, JAMS,
JCSS, SICOMP and other journals. He has served as program
committee member for AAAI, FOCS, ICML, ICDM, RANDOM, SAT,
SODA, WAW and other conferences. His recent work has included
the analysis of large random networks and the use of
randomization to accelerate algorithms in machine learning,
information retrieval, and constraint satisfaction.
Home page: http://research.microsoft.com/~optas
Privacy and Data Mining
Abstract
There is increasing need to build information systems that protect the privacy and ownership of data without impeding the
flow of information. We will present some of our current work to demonstrate the technical feasibility of building such systems:
Privacy-preserving data mining. The conventional wisdom held that data mining and privacy were adversaries, and the use of data mining must be restricted
to protect privacy. Privacy-preserving data mining cleverly finesses this conflict by exploting the difference between the
level where we care about privacy, i.e., individual data, and the level where we run data mining algorithms, i.e., aggregated
data. User data is randomized such that it is impossible to recover anything meaningful at the individual level, while still
allowing the data mining algorithms to recover aggregate information, build mining models, and provide actionable insights.
Hippocratic databases. Unlike the current systems, Hippocratic databases include responsibility for the privacy of data they manage as a founding
tenet. Their core capabilities have been distilled from the principles behind current privacy legislations and guidelines.
We identify the technical challenges and problems in designing Hippocratic databases, and also outline some solutions.
Sovereign information sharing. Current information integration approaches are based on the assumption that the data in each database can be revealed completely
to the other databases. Trends such as end-to-end integration, outsourcing, and security are creating the need for integrating
information across autonomous entities. In such cases, the enterprises do not wish to completely reveal their data. In fact,
they would like to reveal minimal information apart from the answer to the query. We have formalized the problem, identified
key operations, and designed algorithms for these operations, thereby enabling a new class of applications, including information
exchange between security agencies, intellectual property licensing, crime prevention, and medical research.
Speaker: Rakesh Agrawal (IBM Almaden Research Center)
Rakesh Agrawal is an IBM Fellow, whose current research
interests include privacy technolgies for data systems, web
technologies, data mining and OLAP. He leads the Intelligent
Information Sytems Research (aka Quest project) at the IBM
Almaden Research Center, which pioneered key data mining
concepts and technologies. He has published more than 100
research papers and he has been granted 50 patents. He is the
recepient of the ACM-SIGKDD First Innovation Award, ACM-SIGMOD
2000 Innovations Award, as well as the ACM-SIGMOD 2003 Test of
Time Award. He was recently selected as one of the 2003
Scientific American 50, which recognized singular
accomplishments of those who have contributed to the advancement
of technology in the realms of science, engineering, commerce
and public policy. He was singled for devising methods to
preserve the privacy of information in large databases. He is
also a Fellow of IEEE and a Fellow of ACM.
Rakesh Agrawal received the M.S. and Ph.D. degrees in
Computer Science from the University of Wisconsin-Madison in
1983. He also has a B.E. degree in Electronics and Communication
Engineering from the University of Roorkee, and a two-year Post
Graduate Diploma in Industrial Engineering from the National
Institute of Industrial Engineering (NITIE), Bombay. Prior to
joining IBM Almaden in 1990, he was with the Bell Laboratories,
Murray Hill from 1983 to 1989.
Home page: http://www.almaden.ibm.com/u/ragrawal/
Breaking through the syntax barrier: Searching with entities and relations
Abstract:
The next wave in search technology will be driven by the
identification, extraction, and exploitation of real-world
entities represented in unstructured textual sources. Search
systems will either let users express information needs
naturally and analyze them more intelligently, or allow simple
enhancements that add more user control on the search process.
The data model will exploit graph structure where available, but
not impose structure by fiat. First generation Web search,
which uses graph information at the macroscopic level of
inter-page hyperlinks, will be enhanced to use fine-grained
graph models involving page regions, tables, sentences, phrases,
and real-world-entities. New algorithms will combine
probabilistic evidence from diverse features to produce
responses that are not URLs or pages, but entities and their
relationships, or explanations of how multiple entities are
related.
Speaker: Soumen Chakrabarti (Indian Institute of Technology, Bombay)
Soumen Chakrabarti received his B.Tech in Computer Science
from the Indian Institute of Technology, Kharagpur, in 1991 and
his M.S. and Ph.D. in Computer Science from the University of
California, Berkeley in 1992 and 1996. At Berkeley he worked on
compilers and runtime systems for running scalable parallel
scientific software on message passing multiprocessors.
He was a Research Staff Member at IBM Almaden Research Center
from 1996 to 1999, where he worked on the Clever Web search
project and led the Focused Crawling project.
In 1999 he moved as Assistant Professor to Department of
Computer Science and Engineering at the Indian Institute of
Technology, Bombay, where he has been an Associate professor
since 2003. In Spring 2004 he is Visiting Associate professor at
Carnegie-Mellon.
He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB,
ICDE, SODA, STOC, SPAA and other conferences as well as
Scientific American, IEEE Computer, VLDB and other journals. He
holds eight US patents on Web-related inventions. He has served
as vice-chair or program committee member for WWW, SIGIR,
SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor
or editorial board member for DMKD and TKDE journals. He is also
author of a new book on Web Mining.
His current research interests include question answering,
Web analysis, monitoring and search, mining irregular and
relational data, and textual data integration.
Home page: www.cse.iitb.ac.in/~soumen
Real-World Learning With Markov Logic Networks
Abstract
Machine learning and data mining systems
have achieved many impressive successes, but to become truly
widespread they must be able to work with less help from
people. This requires automating the data cleaning and
integration process, handling multiple types of objects and
relations at once, and easily incorporating domain knowledge.
In this talk, I will describe how we are pursuing these aims
using Markov logic networks, a representation that combines
first-order logic and probabilistic graphical models. Data from
multiple sources is integrated by automatically learning
mappings between the objects and terms in them. Rich relational
structure is learned using a combination of ILP and statistical
techniques. Knowledge is incorporated by viewing logic
statements as soft constraints on the models to be
learned. Application to a real-world university domain shows our
approach to be accurate, efficient, and less labor-intensive
than traditional ones.
(Joint work with Parag and Matt Richardson.)
Speaker: Pedro Domingos (University of Washington, Seattle)
Pedro Domingos is an assistant professor in the Department of
Computer Science and Engineering at the University of
Washington. His research interests are in artificial
intelligence, machine learning and data mining. He received a
PhD in Information and Computer Science from the University of
California at Irvine, and is the author or co-author of over 100
technical publications. He is associate editor of JAIR, a member
of the editorial board of the Machine Learning journal, and a
co-founder of the International Machine Learning Society. He was
program co-chair of KDD-2003, and has served on numerous program
committees. He has received several awards, including a Sloan
Fellowship, an NSF CAREER Award, a Fulbright Scholarship, an IBM
Faculty Award, and best paper awards at KDD-98 and KDD-99.
Home page: http://www.cs.washington.edu/homes/pedrod
Strength in diversity: the advance of data analysis
Abstract
Although the origins can be traced back as far as one likes,
the proper scientific analysis of data is really only around a
century old. For most of that century, data analysis was the
realm of only one discipline - statistics. In recent decades,
however, as a consequence of the development of the computer,
things have changed dramatically and now there are several such
disciplines, including machine learning, pattern recognition,
and data mining. Although all of these disciplines are
concerned with extracting information from data, they have
subtle differences in aims and emphasis. This paper looks at
some of the similarities and some of the differences, noting
where the disciplines intersect and, perhaps of more interest,
where they do not. Particular issues examined include the
nature of the data with which they are concerned, the role of
mathematics, differences in the objectives, how the different
areas of application have led to different aims, and how the
different disciplines have led sometimes to the same analytic
tools being developed, but also sometimes to different tools
being developed. Some conjectures about likely future
developments are given.
Speaker: David Hand (Imperial College, London)
David Hand is Professor of Statistics and Head of the
Statistics Section at Imperial College London. He has published
twenty books on statistics and related areas, including
Discrimination and Classification, Analysis of
Repeated Measures, Practical Longitudinal Data
Analysis, Construction and Assessment of Classification
Rules, Intelligent Data Analysis, Statistics
in Finance, and Principles of Data Mining. He is
a Fellow of the Royal Statistical Society and an Honorary Fellow
of the Institute of Actuaries. He launched the journal
Statistics and Computing in 1991, and also served a
term of office as editor of Journal of the Royal Statistical
Society, Series C. He was awarded the Thomas L. Saaty Prize
for Applied Advances in the Mathematical and Management Sciences
in 2001 and the Royal Statistical Society’s Guy Medal in Silver
in 2002, and was elected Fellow of the British Academy, the UK’s
leading learned society for the humanities and social sciences,
in 2003. His research interests include classification methods,
the fundamentals of statistics, and data mining, and his
applications interests include medicine and finance. He has
acted as a consultant to a wide range of organizations,
including governments, banks, pharmaceutical companies,
manufacturing industry, and health service providers.
Home page: http://stats.ma.imperial.ac.uk/~djhand/
|