Class Projects
2007 |
2006 |
2005 |
2004 |
Ideas |
Proposal format |
Evaluation form
Spring 2007
- Mining Pharmacogenomics Information using Topical Web Crawlers
- Pagerank & Sample Size
- Can we beat Cinematch (Netflix Recommendation System)?
- SMILES Index for Retrieving Chemical Information
- Social network community emergence around new digital media
- Trend Prediction
- Evaluating Hypertext Documents for Authenticity
Spring 2006
- Usage Statistics of Robots
Exclusion Standard (paper
in Proc. IADIS WWW/Internet 2006)
- Mining for Blog communities
- Directed
News Analysis
- KidsCrawler
- Ontology Generation from
Specialized Corpora
- Using
Online Social Networks in Topical Sentiment Analysis
- Using Page
History to Rank Search Results
- Web Mining
Developmental Trends in Social Networks
- Web Topology of the Indiana
University Domain (paper
in Proc. IV07)
- Web user
profiling and its applicability to system security
Top
Spring 2005
- Multilingual news
search
- Effects of
guided summarization on QA using the Web
- Mining
people connections
- Personalized
search by history context
- Phishing Attacks Using
Social Networks (see coverage in IDS, IDS, and Slashdot;
paper
to appear in CACM)
- Structural
evolution of Web content
- Clustering of
political opinion sites using unsupervised techniques
Top
Spring 2004 (sample)
- Experiments with PageRank Computation
- Clustering Weblogs using LSA and Link-based Methods
- Sherlock News Search Engine
- Focused Crawlers vs Accelerated Focused Crawlers
- Domain-Based PageRank Personalization (paper presented at WebKDD 2004!)
Top
Ideas for Future Projects
- Wikipedia's publicly-available data set provides a wealth of information
about the structure and evolution of a dynamic socially-edited forum.
Possible research includes automated fact-checking and user authority measures
(date of registration, number of posts, average size and longevity of posts, etc).
Another avenue of investigation would be correlating IP addresses / usernames
with changes to specific hot-topic issues. An automated method to detect
suspicious editing / revisions could aid in the identification of biased or
self-serving modifications leading to the identification of the offending
individuals / organnizations. (Submitted by
Mike Conover,
based on discussion in class and prior conversations with
Virgil Griffith)
- Create a meta-search engine that finds potentially
"embarrassing" personal material (photos, videos, text, etc)
by mining various sources such as search engines, social network sites,
photo and video sites, etc. This could play an educational service by
highlighting the dangers of posting private information on the Web.
(Based on an idea by
Mike Conover)
- Develop a desktop application (eg for the Google Desktop, the
Yahoo Desktop, Windows Desktop, or Apple Dashboard Widget) that
implements some (simplified/extended) version of the HITS algorithm,
discussed in class. This would be a client-based solution for the
query-time analysis. (Inspired by Michel Salim)
- The economics of Google Ads are inducing (creating incentives for)
a "pollution" of information on the Web. People create fake
"original content" with popular query terms to attract traffic and
make a profit through advertising. This is being done both manually (by
underpaid hired writers) and with automatic text generation scripts.
Can we devise techniques to clean the Web from such pollution?
(Submitted by Fil)
- Try to come up with a list of top-X sites frequented by spam
harvesters. For example I created a gmail account to submit a script to
CPAN, which posts the email of authors on their site. That account has
quickly become a honeypot with thousands of spam messages. So clearly
spammers harvest emails from CPAN. What other sites? What are the worst?
You could write a crawler that automatically posts email addresses to
sites it encounters (message boards, etc) and then monitors which sites
generate the most spam. (Submitted by Fil)
- A machine learning method to classify an arbitrary Web page as blog
or not blog, for crawling purposes. (Submitted by Alex
Breuer)
- A text mining algorithm to find a huge set of triples (email
address, name, address) from crawls of personal websites, blogs, bios,
etc. and cross-reference with structured databases such as ip-to-zip
converter, phone book, and other online resources.
This problem is one of the holy grails in phishing right now, and nobody
has addressed how it can be done or protected against. (Submitted by Markus Jakobsson,
who could act as external co-supervisor)
Top
Project Proposal Format
The project proposal is free-format but cannot be longer than a page
in length. Use your best judgement about margins and font size (fitting
too much on a page would be a bad idea!). The proposal should be concise,
concrete, focused and to the point. It should answer a few basic questions:
- Why? (Motivate your idea; is it interesting, important, relevant?)
- What? (Exactly what do you propose?)
- How? (State your hypothesis and evaluation procedure)
- When? (You need a realistic timetable and deliverable; is it doable?)
Top