Table of contents
  1. Story
  2. Slides
  3. Spotfire Dashboard
  4. Research Notes
  5. KDD Cup
    1. KDD 2013: Determine whether an author has written a given paper
      1. Description
      2. Evaluation
      3. Competition Rules
      4. Prizes
      5. Frequently Asked Questions
      6. Organizers
        1. Microsoft Research
          1. Vani Mandava
        2. University of Washington
          1. Senjuti Basu Roy, Ph.D.
          2. Swapna Savvana
        3. Ghent University
          1. Martine De Cock, Ph.D.
        4. KDD Cup Chairs
          1. Claudia Perlich, Ph.D.
          2. Brian Dalessandro
          3. Ben Hamner
          4. William Cukierski, Ph.D.
      7. Preliminary Winners
      8. Timeline
      9. Data Files
    2. KDD Cup 2012: Predict which users (or information sources) one user might follow in Tencent Weibo
      1. Background
      2. Task 1 Description
      3. Data Sets
      4. Evaluation
      5. Prizes
      6. Data Files
    3. KDD Cup 2012: Predict the click-through rate of ads given the query and user information
      1. Task 2 Description
      2. Training Data File
      3. Additional Data Files
      4. Testing Data Set
      5. Evaluation
      6. Prizes
      7. Data Files
    4. KDD Cup 2011: The Yahoo! Music Dataset and KDD-Cup'11
      1. Abstract
      2. 1. Introduction
      3. 2. The Yahoo! Music Dataset
        1. 2.1. Yahoo! Music Radio service
        2. 2.2. Ratings dataset
          1. Figure 1: Power-law distribution of the number of ratings per user and per item
          2. Figure 2: Rating distribution in the training and validation datasets
          3. Figure 3: Item and user rating characteristics
          4. Figure 4: Temporal statistics. Time is measured in weeks since the first rating
          5. Table 1: Datasets size statistics
      4. 4. Contest conduct
          1. Figure 5: Submissions per day
          2. Figure 6: Leadership and Performance over time
          3. Table 2: Submissions statistics: Number of submissions per team
          4. Table 3: The winning teams and their performances on Test1 and Test2.
      5. 5. Lessons
          1. Figure 7: Histogram of the final submissions' performance
        1. 5.1. Maneuvers by the contestants
      6. References
        1. 1
        2. 2
        3. 3
        4. 4
        5. 5
        6. 6
        7. 7
        8. 8
        9. 9
        10. 10
        11. 11
        12. 12
        13. 13
        14. 14
    5. KDD Cup 2010: Student performance evaluation
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Contacts
    6. KDD Cup 2009: Customer relationship prediction
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. FAQs
      7. Contacts
    7. KDD Cup 2008: Breast cancer
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. FAQs
      6. Contacts
    8. KDD Cup 2007: Consumer recommendations
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    9. KDD Cup 2006: Pulmonary embolisms detection from image data
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    10. KDD Cup 2005: Internet user search query categorization
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    11. KDD Cup 2004: Particle physics; plus protein homology prediction
      1. This Year's Challenge
      2. Tasks
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    12. KDD Cup 2003: Network mining and usage log analysis
      1. This Year's Challenge
      2. Tasks
      3. Competition Rules
      4. Data
      5. Results
      6. General Questions
      7. Contacts
    13. KDD Cup 2002: BioMed document; plus gene role classification
      1. This Year's Challenge
      2. Tasks
      3. Data Download
      4. Results
      5. General FAQs
      6. Contacts
    14. KDD Cup 2001: Molecular bioactivity; plus protein locale prediction
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. Frequently Asked Questions
      6. Contacts
    15. KDD Cup 2000: Online retailer website clickstream analysis
      1. This Year's Challenge
      2. Tasks
      3. Data Download
      4. Results
      5. Contacts
    16. KDD Cup 1999: Computer network intrusion detection
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. Contacts
    17. KDD Cup 1998: Direct marketing for profit optimization
      1. This Year's Challenge
      2. Task Description
      3. Evaluation
      4. Data Download
      5. Results
      6. Contacts
    18. KDD Cup 1997: Direct marketing for lift curve optimization
      1. KDD Cup 1998 Data
        1. Abstract
        2. Usage Notes
  6. NEXT

KDD Cup

Last modified
Table of contents
  1. Story
  2. Slides
  3. Spotfire Dashboard
  4. Research Notes
  5. KDD Cup
    1. KDD 2013: Determine whether an author has written a given paper
      1. Description
      2. Evaluation
      3. Competition Rules
      4. Prizes
      5. Frequently Asked Questions
      6. Organizers
        1. Microsoft Research
          1. Vani Mandava
        2. University of Washington
          1. Senjuti Basu Roy, Ph.D.
          2. Swapna Savvana
        3. Ghent University
          1. Martine De Cock, Ph.D.
        4. KDD Cup Chairs
          1. Claudia Perlich, Ph.D.
          2. Brian Dalessandro
          3. Ben Hamner
          4. William Cukierski, Ph.D.
      7. Preliminary Winners
      8. Timeline
      9. Data Files
    2. KDD Cup 2012: Predict which users (or information sources) one user might follow in Tencent Weibo
      1. Background
      2. Task 1 Description
      3. Data Sets
      4. Evaluation
      5. Prizes
      6. Data Files
    3. KDD Cup 2012: Predict the click-through rate of ads given the query and user information
      1. Task 2 Description
      2. Training Data File
      3. Additional Data Files
      4. Testing Data Set
      5. Evaluation
      6. Prizes
      7. Data Files
    4. KDD Cup 2011: The Yahoo! Music Dataset and KDD-Cup'11
      1. Abstract
      2. 1. Introduction
      3. 2. The Yahoo! Music Dataset
        1. 2.1. Yahoo! Music Radio service
        2. 2.2. Ratings dataset
          1. Figure 1: Power-law distribution of the number of ratings per user and per item
          2. Figure 2: Rating distribution in the training and validation datasets
          3. Figure 3: Item and user rating characteristics
          4. Figure 4: Temporal statistics. Time is measured in weeks since the first rating
          5. Table 1: Datasets size statistics
      4. 4. Contest conduct
          1. Figure 5: Submissions per day
          2. Figure 6: Leadership and Performance over time
          3. Table 2: Submissions statistics: Number of submissions per team
          4. Table 3: The winning teams and their performances on Test1 and Test2.
      5. 5. Lessons
          1. Figure 7: Histogram of the final submissions' performance
        1. 5.1. Maneuvers by the contestants
      6. References
        1. 1
        2. 2
        3. 3
        4. 4
        5. 5
        6. 6
        7. 7
        8. 8
        9. 9
        10. 10
        11. 11
        12. 12
        13. 13
        14. 14
    5. KDD Cup 2010: Student performance evaluation
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Contacts
    6. KDD Cup 2009: Customer relationship prediction
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. FAQs
      7. Contacts
    7. KDD Cup 2008: Breast cancer
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. FAQs
      6. Contacts
    8. KDD Cup 2007: Consumer recommendations
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    9. KDD Cup 2006: Pulmonary embolisms detection from image data
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    10. KDD Cup 2005: Internet user search query categorization
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    11. KDD Cup 2004: Particle physics; plus protein homology prediction
      1. This Year's Challenge
      2. Tasks
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    12. KDD Cup 2003: Network mining and usage log analysis
      1. This Year's Challenge
      2. Tasks
      3. Competition Rules
      4. Data
      5. Results
      6. General Questions
      7. Contacts
    13. KDD Cup 2002: BioMed document; plus gene role classification
      1. This Year's Challenge
      2. Tasks
      3. Data Download
      4. Results
      5. General FAQs
      6. Contacts
    14. KDD Cup 2001: Molecular bioactivity; plus protein locale prediction
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. Frequently Asked Questions
      6. Contacts
    15. KDD Cup 2000: Online retailer website clickstream analysis
      1. This Year's Challenge
      2. Tasks
      3. Data Download
      4. Results
      5. Contacts
    16. KDD Cup 1999: Computer network intrusion detection
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. Contacts
    17. KDD Cup 1998: Direct marketing for profit optimization
      1. This Year's Challenge
      2. Task Description
      3. Evaluation
      4. Data Download
      5. Results
      6. Contacts
    18. KDD Cup 1997: Direct marketing for lift curve optimization
      1. KDD Cup 1998 Data
        1. Abstract
        2. Usage Notes
  6. NEXT

  1. Story
  2. Slides
  3. Spotfire Dashboard
  4. Research Notes
  5. KDD Cup
    1. KDD 2013: Determine whether an author has written a given paper
      1. Description
      2. Evaluation
      3. Competition Rules
      4. Prizes
      5. Frequently Asked Questions
      6. Organizers
        1. Microsoft Research
          1. Vani Mandava
        2. University of Washington
          1. Senjuti Basu Roy, Ph.D.
          2. Swapna Savvana
        3. Ghent University
          1. Martine De Cock, Ph.D.
        4. KDD Cup Chairs
          1. Claudia Perlich, Ph.D.
          2. Brian Dalessandro
          3. Ben Hamner
          4. William Cukierski, Ph.D.
      7. Preliminary Winners
      8. Timeline
      9. Data Files
    2. KDD Cup 2012: Predict which users (or information sources) one user might follow in Tencent Weibo
      1. Background
      2. Task 1 Description
      3. Data Sets
      4. Evaluation
      5. Prizes
      6. Data Files
    3. KDD Cup 2012: Predict the click-through rate of ads given the query and user information
      1. Task 2 Description
      2. Training Data File
      3. Additional Data Files
      4. Testing Data Set
      5. Evaluation
      6. Prizes
      7. Data Files
    4. KDD Cup 2011: The Yahoo! Music Dataset and KDD-Cup'11
      1. Abstract
      2. 1. Introduction
      3. 2. The Yahoo! Music Dataset
        1. 2.1. Yahoo! Music Radio service
        2. 2.2. Ratings dataset
          1. Figure 1: Power-law distribution of the number of ratings per user and per item
          2. Figure 2: Rating distribution in the training and validation datasets
          3. Figure 3: Item and user rating characteristics
          4. Figure 4: Temporal statistics. Time is measured in weeks since the first rating
          5. Table 1: Datasets size statistics
      4. 4. Contest conduct
          1. Figure 5: Submissions per day
          2. Figure 6: Leadership and Performance over time
          3. Table 2: Submissions statistics: Number of submissions per team
          4. Table 3: The winning teams and their performances on Test1 and Test2.
      5. 5. Lessons
          1. Figure 7: Histogram of the final submissions' performance
        1. 5.1. Maneuvers by the contestants
      6. References
        1. 1
        2. 2
        3. 3
        4. 4
        5. 5
        6. 6
        7. 7
        8. 8
        9. 9
        10. 10
        11. 11
        12. 12
        13. 13
        14. 14
    5. KDD Cup 2010: Student performance evaluation
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Contacts
    6. KDD Cup 2009: Customer relationship prediction
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. FAQs
      7. Contacts
    7. KDD Cup 2008: Breast cancer
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. FAQs
      6. Contacts
    8. KDD Cup 2007: Consumer recommendations
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    9. KDD Cup 2006: Pulmonary embolisms detection from image data
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    10. KDD Cup 2005: Internet user search query categorization
      1. This Year's Challenge
      2. Task Description
      3. Competition Rules
      4. Data
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    11. KDD Cup 2004: Particle physics; plus protein homology prediction
      1. This Year's Challenge
      2. Tasks
      3. Competition Rules
      4. Data Download
      5. Results
      6. Frequently Asked Questions
      7. Contacts
    12. KDD Cup 2003: Network mining and usage log analysis
      1. This Year's Challenge
      2. Tasks
      3. Competition Rules
      4. Data
      5. Results
      6. General Questions
      7. Contacts
    13. KDD Cup 2002: BioMed document; plus gene role classification
      1. This Year's Challenge
      2. Tasks
      3. Data Download
      4. Results
      5. General FAQs
      6. Contacts
    14. KDD Cup 2001: Molecular bioactivity; plus protein locale prediction
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. Frequently Asked Questions
      6. Contacts
    15. KDD Cup 2000: Online retailer website clickstream analysis
      1. This Year's Challenge
      2. Tasks
      3. Data Download
      4. Results
      5. Contacts
    16. KDD Cup 1999: Computer network intrusion detection
      1. This Year's Challenge
      2. Task Description
      3. Data Download
      4. Results
      5. Contacts
    17. KDD Cup 1998: Direct marketing for profit optimization
      1. This Year's Challenge
      2. Task Description
      3. Evaluation
      4. Data Download
      5. Results
      6. Contacts
    18. KDD Cup 1997: Direct marketing for lift curve optimization
      1. KDD Cup 1998 Data
        1. Abstract
        2. Usage Notes
  6. NEXT

Story

KDD Cup Datasets: EDA for Data Science

KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners. Kaggle assumed leadership of the KDD Cup in 2011.

 
In Doing Data Science, Chapter  7 in the sub-section entitled: Data Science Competitions Cut Out All the Messy Stuff, it says:

"Competitions might be seen as formulaic, dry, and synthetic compared to what you’ve encountered in normal life. Competitions cut out the messy stuff before you start building models—asking good questions, collecting and cleaning the data, etc.—as well as what happens once you have your model, including visualization and communication. The team of Kaggle data scientists actually spends a lot of time creating the dataset and evaluation metrics, and figuring out what questions to ask, so the question is: while they’re doing data science, are the contestants?"

In Chapter 13 entitled: Lessons Learned from Data Competitions: Data Leakage and Model Evaluation, it says:

"Claudia Perlich is a famously successful data mining competition winner. She won the KDD Cup in 2003, 2007, 2008, and 2009,

Claudia drew a distinction between different types of data mining competitions. The first is the “sterile” kind, where you’re given a clean, prepared data matrix, a standard error measure, and features that are often anonymized. This is a pure machine learning problem.

On the other hand, you have the “real world” kind of data mining competition, where you’re handed raw data (which is often in lots of different tables and not easily joined), you set up the model yourself, and come up with task-specific evaluations. This kind of competition simulates real life more closely. You need practice dealing with messiness.

Claudia Perlich likes to understand something about the world by looking directly at the data.

Examples of this second kind are KDD cup 2007, 2008, and 2010. If you’re in this kind of competition, your approach would involve understanding the domain, analyzing the data, and building the model. The winner might be the person who best understands how to tailor the model to the actual question.

Claudia prefers the second kind, because it’s closer to what you do in real life." END OF BOOK EXCERPT

I also like to look directly at the data and use that to help me decide what to do with the data. My audit results are summarized in the table below.                         

Year Title Data Set  Comment Spotfire File and Web Player Columns Rows Comment
1997 Direct marketing for lift curve optimization KDD Cup 1998 Data

Same as 1998

231 MB

Spotfire

File (88 MB) and Web Player

3, 479, & 481 288,146 Learning and Validation
1998 Direct marketing for profit optimization Data Download

Same as 1997

231 MB

Spotfire

File (88 MB) and Web Player

3, 479, & 481 288,146 Learning and Validation
1999 Computer network intrusion detection Data Download

1.4 GB

Spotfire

File (26 MB) and Web Player

    5 datasets
2000 Online retailer website clickstream analysis Data Download         .DAT files
2001 Molecular bioactivity; plus protein locale prediction Data Download        

DATA files

See slides

2002 BioMed document; plus gene role classification Data Download 127 MB

Spotfire

File (0.6 MB) and Web Player

80,460 1-5

13 files and 15,234 Abstract Files

Note: Training and test data for this task is no longer available due to copyright concerns.

See slides?
2003 Network mining and usage log analysis Data         TAR files
2004 Particle physics; plus protein homology prediction Data Download         TAR file
2005 Internet user search query categorization Data         KDD Cup 2005 website page not found
2006 Pulmonary embolisms detection from image data Data Download         TAR and DAT files
2007 Consumer recommendations Data Download

Two Text Files

      Netflix Prize training dataset (no longer available)
2008 Breast cancer Data Download 410 MB

Spotfire

File (170MB) and Web Player

488,786 3-117 5 files-See ReadMe for explanation
2009 Customer relationship prediction Data Download         .DAT files
2010 Student performance evaluation Data Download 10 GB

Spotfire

File (1.5 GB) and Web Player

52,513,368 2-23 18 datasets-master, test, submission & train
2011 The Yahoo! Music Dataset Ratings dataset         Not Found
2012 Track 2 Predict the click-through rate of ads given the query and user information Data Files 4.3 GB

Spotfire

File (??GB) and Web Player

220,322,644 2-12 7 datasets
2012 Track 1 Predict which users (or information sources) one user might follow in Tencent Weibo Data Files 1.1 GB

Spotfire

File (195 MB) and Web Player

38,931,318 2-4 4 datasets
2013 Determine whether an author has written a given paper Data Files

Kaggle

871 MB

Spotfire

File (483 MB) and Web Player

15,461,646 2-6 14 datasets-7 test and 7 full
2014              

This way one does not have to guess what is in the ZIP files, log-in to download the ZIP files, and then unzip the files, one can see the data and decide if they can use it in Spotfire. In this audit I discovered two file formats I had heard about abut not really worked with until now: .DAT and .TAR, so I did some research on the Web and found the following explanation which I found useful.

What Is DAT File Format?
By Michael Wolfe, eHow Contributor
  
The file extension "DAT" stands for "data."
The file extension ".dat," when it appears at the end of a file name, stands for "data." Among the most common file extensions in software, the DAT file lacks any standardized structure and may have been created by almost any program.

Features

  • According to the website FileInfo.com, .dat files are generic data files created and used by software applications. The information inside of the file usually is encoded in a binary or text format. Because the files lack a standardized structure, usually they can be properly used only by the application that created them.

Considerations

  • DAT files are created by a number of common programs, including Microsoft Internet Explorer, Nullsoft Winamp, Corel WordPerfect, Runtime GetDataBack and Pitney Bowes Mapinfo. However, although the file may be used by the program, it doesn't mean necessarily that it can be directly opened by the application. DAT files are supported as native file types by few applications.

Solution

  • It is often difficult to determine what a DAT file contains or by which application it was created. To determine where the file came from and what it contains, a user can open the file with a text editor, such as EditPad, and examine its content. END OF STORY

What is a tar file and how do I open a tar file?

TAR, or tape archive, is both a file format and the name of the program used to handle such files. TAR is used to collect many files into one larger file for distribution or archiving, while preserving file system information such as user and group permissions, dates and directory structures. Source code is often packed for download as a TAR, or Tape ARchive file, the standard format in Unix/Linux. 7-Zip and TarTool can be used to unpack them. END OF STORY

I used 7-Zip. The main features of 7-Zip are: Powerful file manager, High compression ratio and high speed, and Big number of supported archive formats. So you first Unzip and then UnTar.

What I learned from this exercise to audit the KDD Cup datasets was as follows:

  • Some datasets are no longer available because the Web site has gone away or the competition has ended (e.g. Netflix Prize in 2007 and Yahoo Music in 2011);
  • Some datasets could not be readily read by Spotfire (there may be a clever way to convert them but I did not have the time to fully explore that);
  • The readable data sets are large ZIP files requiring time to download and unzip, but are compressed considerably in Spotfire;
  • The Spotfire files (8) I created consist of three tabs: Cover Page (see the data sets), Data Ecosystem (overview of the data set characteristics), and EDA (a few visualizations to see if there are relationships);
  • Most of the data sets do not contain column definitions so one has to see the limited documentation or guess; and
  • Five of the eight Spotfire files can be uploaded to this Wiki, but three are to large (>250 MB) and need to be uploaded elsewhere.

So now I better understand why Claudia Perlich drew a distinction between different types of data mining competitions and why I, as a data scientist used to what I would call more "government statistical than found data", find these data sets less understandable and easy to use. At least I have made them easier to use by others by doing the data mining and preparation so one can ready see the data in Spotfire and the data ecosystem in a spreadsheet.

Slides

Spotfire Dashboard

KDD Cup

Source: http://www.kdd.org/kddcup/index.php

KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners.
Below are links to the descriptions of all past tasks.

KDD 2013: Determine whether an author has written a given paper

Description

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. On one hand, there are many authors who publish under several variations of their own name.  On the other hand, different authors might share a similar or even the same name.

As a result, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to him or her. This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author.

Evaluation

The goal of this competition is to predict which papers were written by the given author.

The evaluation metric for this competition is Mean Average Precision.

An author profile contains an author and a set of papers that may or may not have been written by that author. The goal is to rank the papers in a list, so that the YES instances (the papers that have been written by the author) come before the NO instances. This ranking is evaluated using average precision. This means that we calculate the precision at each point when an actual YES instance occurs in the ranking, and then take the average of those values (i.e. the MAP, or Mean Average Precision).

Example 1
Suppose that an author profile contains 7 papers P1, P2, P3, P4, P5, P6, P7. Out of these, only paper P2 has been written by the author. Your task is to provide a ranked list of these 7 papers, with the papers that have been written by the author on top. Say that your system returns the ranked list P7 P3 P2 P4 P1 P6 P5 because it thinks that papers P7 and P3 have been written by the author, and the others have not. The AP in this case is (1/3)/1 = 0.34.

Example 2
Suppose that an author profile contains 5 papers P1, P2, P3, P4 and P5. Papers P3 and P5 have been written by the author, while papers P1, P2 and P4 have not. Say that your system returns the ranked list P3 P1 P4 P5 P2. In this case the AP is (1/1 + 2/4)/2 = 0.75. Note that the relative ordering of the NO instances does not have an effect on the score. Neither does the relative ordering of the YES instances. For instance, if your system would return the ranked list P5 P2 P1 P3 P4 the AP would still be (1/1 + 2/4)/2 = 0.75.

Example 3
Suppose that an author profile contains 10 papers P1, P2, P3, P4, P5, P6, P7, P8, P9 and P10. Papers P1, P3, P5, P6, P7 and P9 have been written by the author, while papers P2, P4, P8 and P10 have not. Say that your system returns the ranked list P1 P3 P5 P6 P7 P9 P2 P4 P8 P10 then the AP is (1/1+2/2+3/3+4/4+5/5+6/6)/6 = 1, i.e. the highest score possible. If your system would return the ranked list P5 P2 P1 P10 P7 P4 P3 P8 P9 P6 then the AP would be (1/1+2/3+3/5+4/7+5/9+6/10)/6 = 0.67.

The average of these average precisions over all author profiles is calculated and displayed as the score on the public leaderboard.

Submission File

The validation and test files will have author IDs and a list of candidate paper IDs. The final submission should rank only the paper IDs for a given author that were included in the original validation and test sets.

Example submission files can be downloaded from the data page. Submission files should contain two columns, AuthorId and PaperIds, with the header:

AuthorId,PaperIds
420,360546 24220 168137 424838 
2759,5738240 50667169 4347791

Competition Rules

  • One account per participant

    You cannot sign up to Kaggle from multiple accounts and therefore you cannot submit from multiple accounts.

  • No private sharing outside teams

    Privately sharing code or data outside of teams is not permitted. It's OK to share code if made available to all players on the forums.

  • Team Mergers

    Team mergers are allowed and can be performed by the team leader. In order to merge, the combined team must have a total submission count less than or equal to the maximum allowed as of the merge date.

  • Team Limits

    There is no maximum team size.

  • Submission Limits

    You can submit a maximum of 5 entries per day.

    You can select up to 1 final submissions for judging.

All training and scoring models must be frozen and submitted by June 19th, 2013. You must release your code under a Apache 2.0 license in order to be eligible for recognition and prize money. Once the test data is released, participants can not update their models or scoring algorithms. The code submitted by June 19th, 2013 should be executable and should return the same results on the test data that the participant submits. Discrepancies between the final score submitted by a team and the score derived by running the submitted code may render the team disqualified from placing in the competition and receiving prizes.

Kaggle, Microsoft, and the KDD Cup Committee are not responsible for finding any discrepancies between the model participants submit and the models the winners used to make predictions on the test set. It is up to the community to self-regulate this and identify concerns as according to the competition timeline.

You may not use any external data page for the purposes of this competition.

You agree to the additional competition rules.

Winners will need to sign the eligibility and release form in order to receive prize money.

Additionally, you agree to the Academic Search Data Terms of Use.

The Competition Organizers reserve the right to make any changes to the structure of this competition. All decisions made by the competition organizers are final.

Prizes

The total prize pool for this competition is $7,500.

The prizes will be distributed to the top four places as follows:

  • First Place - $2,500
  • Second Place - $1,750
  • Third Place - $1,250
  • Fourth Place - $1,000

The remaining $1,000 will be disbursed at the organizer's discretion to help students attend the conference. For a student to be eligible to receive this money, he or she needs to be ranked in 5th-10th place.

Frequently Asked Questions

How was the data released here sampled from the original data?

PaperAuthor is sampled uniformly at random from the original PaperAuthor dataset at Microsoft Academic Search. Note that this original dataset (or the sampled PaperAuthor thereafter) is subject to noise.

Additionally, we have access to ground truth (how ground truth is obtained is orthogonal to the problem) for a subset of paper-author pairs. All evaluation related files, such as Train.csv and Valid.csv, are created with uniform random sampling from this ground truth data.

Finally, PaperAuthor contains a superset of all paper-author pairs present in Valid.csv and Train.csv. Additional metadata about the papers and authors is provided in Author.csv, Paper.csv, Journal.csv, Conference.csv.

In PaperAuthor.csv, why are there paper author pairs with duplicate ids?

In MAS, data is collected from multiple sources. Due to this, it is possible to have slight to moderate variation in the metadata of the same paper-author pair. Released PaperAuthor dataset is subject to that noise, and therefore may have paper-author pairs with duplicate ids. Additionally, it is also possible to have multiple affiliations associated with the same paper-author pair. This may also give rise to duplicate ids for the same paper-author pairs.

Do the metadata files contain noise?

Yes, it is possible that Author.csv, Paper.csv, Journal.csv, Conference.csv contain noise.

Organizers

Microsoft Research
Vani Mandava

Vani Mandava is a Senior Program Manager with Microsoft Research at Redmond with over 12 years of experience designing and shipping software projects and features that are in use by millions of users across the world. Her efforts in the Microsoft Research Connections team is to enable academic researchers and institutions to develop technologies that fuel data-intensive scientific research using advanced techniques in data management and data mining. She currently leads data efforts for Microsoft Academic Search. Vani holds a Masters degree in Computer Science with a focus on machine learning from State University of New York, University at Buffalo. She has enabled the adoption of data mining best practices in various products across Microsoft client, server and services in MS-Office, Sharepoint and Online Services (Ad Center) organizations. She co-authored a book ‘Developing Solutions with Infopath’ and holds patents in service infrastructure design. 

University of Washington
Senjuti Basu Roy, Ph.D.

Dr. Senjuti Basu Roy is an Assistant Professor at the Institute of Technology at the University of Washington Tacoma. She is also an active member of the Center for Web and Data Science at the University of Washington and leads research in data analytics at the center. She joined the Institute of Technology in January 2012. Prior to joining UW, she was a postdoctoral fellow at DIMACS at Rutgers University and was part of the Graph Mining and Knowledge Discovery project. Before that, Senjuti received her Ph.D. in Computer Science from University of Texas at Arlington in May 2011. Her past research experience includes working at Microsoft Research and IBM Research.

Senjuti’s primary research interests lie in the area of data and content management with a focus on exploration, data analytics, and algorithms. Her research has been published in notable database conferences and journals, including SIGMOD, VLDB(conference and journal), ICDE, CIKM, WWW. She has served on the program committees of WebDB, IIWeb, GDM, SIGMOD Travel Award Committee. Senjuti is an invited reviewer for journals including PVLDB, Information Systems, TKDE.

Senjuti has designed and taught courses in database, data analytics and mining at the University of Washington Tacoma and is a graduate faculty at the University of Washington engaged in graduate supervision.

Swapna Savvana

Swapna Savvana is a Research Assistant at the Center for Web and Data Science at the University of Washington Tacoma. She is currently pursuing her MSCSS degree. Her major research interests include data mining and analysis, machine learning, information retrieval, etc. Her current research focuses on analysis and exploration of academic search data. She has been involved in building the user interface for the dietary data recording system for the Fred Hutchinson Cancer Research Center. She received her Bachelors of technology degree in Computer Science in 2006. She worked with major retail banks, where she was involved in various development projects in the area of information management and business intelligence. Swapna Savvana is a qualified Java, SQL Developer and has an extensive knowledge in database systems and data mining.

Ghent University
Martine De Cock, Ph.D.

Dr. Martine De Cock is an associate professor at the Department of Applied Mathematics, Computer Science and Statistics at Ghent University. She holds a M.Sc. and a Ph.D. degree in Computer Science from this university. She worked as a visiting scholar at the University of California, Berkeley and at Stanford University, and as a visiting associate professor at the Institute of Technology of the University of Washington. Dr. De Cock currently leads the Computational Web Intelligence (CWI) team at Ghent University, where she is supervising several PhD students. She is co-author of 3 books and more than 130 peer reviewed publications, among which more than 45 international journal articles. She is involved in the organization of major conferences on computational intelligence and web intelligence, and she is an editorial board member of various journals in the field. She is also a member of the Emergent Technology Technical Committee (ETTC) of the IEEE Computational Intelligence Society (CIS). She has taught several courses on Computational Intelligence and on Web Search and Online Advertising at Ghent University and at the University of Washington.

KDD Cup Chairs
Claudia Perlich, Ph.D.

Claudia Perlich serves as Chief Scientist at media6degrees and in this role designs, develops, analyzes and optimizes the machine learning that drives digital advertising to prospective customers of brands. An active industry speaker and frequent contributor to industry publications, Claudia enjoys serving as a guide in world of data, and was recently named winner of the Advertising Research Foundation's (ARF) Grand Innovation Award and was selected as member of the Crain's NY annual 40 Under 40 list. Additionally, she has been published in over 30 scientific journals, and holds multiple patents in machine learning. She has won many data mining competitions, including the prestigious KDD Cup three times for her work on movie ratings in 2007, breast cancer detection in 2008, and churn and propensity predictions for telecom customers in 2009, as well as the KDD best paper award for data leakage in 2011 and bid optimization in 2012. Prior to joining m6d, she worked in Data Analytics Research at IBM's Watson Research Center, concentrating on data analytics and machine learning for complex real-world domains and applications. Claudia has a PhD in Information Systems from NYU and an MA in Computer Science from the University of Colorado. Claudia takes active interest in the making of the next generation of data scientists and is teaching "Data Mining for Business Intelligence" in the NYU Stern MBA program. 

Brian Dalessandro

Brian is VP of Data Science at media6degrees, where he leads the development of m6d's patent pending machine learning technology. His current research interests include building autonomous machine learning systems over big data architectures, transfer learning, and influence attribution. This is Brian's second year as the co-chair of the 2012 KDD Cup competition. Prior to joining m6d, he was a Senior Research Analyst at Meetup.com, and a credit risk modeler for American Express. He holds an MBA with a concentration in Statistics from NYU and a BS in Mathematics and French Literature from Rutgers.

Ben Hamner

Ben Hamner is responsible for data analysis, machine learning, and competitions at Kaggle. He has worked with machine learning problems in a variety of different domains, including natural language processing, computer vision, web classification, and neuroscience. Prior to joining Kaggle, he applied machine learning to improve brain-computer interfaces as a Whitaker Fellow at the École Polytechnique Fédérale de Lausanne in Lausanne, Switzerland. He graduated with a BSE in Biomedical Engineering, Electrical Engineering, and Math from Duke University.

William Cukierski, Ph.D.

William Cukierski is a data scientist at Kaggle. He has a bachelor’s degree in physics from Cornell University and a Ph.D. in biomedical engineering from Rutgers University, where he studied applications of machine learning to cancer research. As a former Kaggle participant, he finished competitively in predictive data competitions on topics ranging from predicting stock movements, to forecasting grocery shopping, to automated essay grading.

Preliminary Winners

Congratulations to the preliminary winners on this competition!

1st Place: Team Algorithm

2nd Place: Team Dmitry&Leustagos&BS Man

3rd Place: Team mb74

4th Place: Team n_m

Timeline

Competition Timeframe

  • Thursday, April 18, 2013 - Competition begins. Training and validation data released.
  • Wednesday, June 12, 2013 - Deadline to enter contest and make a submission on the valid set. Public leaderboard frozen.
  • Thursday, June 13, 2013 - Valid set solutions released. Participants may optionally retrain their model on the combined training + valid sets.
  • Wednesday, June 19, 2013 - Model submission deadline. Submitted models should follow the model submission best practices as closely as possible.
  • Thursday, June 20, 2013 - Test data released
  • Wednesday, June 26, 2013 - Deadline to submit test set predictions

Winning Code Verification

  • Thursday, June 27, 2013 - Preliminary winners notified
  • Friday, June 28, 2013 -  Deadline for preliminary winners to open source models. As specified in the competition rules, you must release your code under an Apache 2.0 license in order to be eligible for prize money.
  • Tuesday, July 9, 2013 - Winners finalized.

Conference Preparation

  • Updated: Tuesday, July 23, 2013 - Deadline for top teams to inform the organizers whether they will be attending the KDD Workshop
  • Updated: Tuesday, July 23, 2013 - Deadline for teams to submit KDD Workshop papers

KDD Cup Workshop

  • August 11-14 2013, Precise Day and Time TBD - KDD Cup Workshop. This will include presentations from Microsoft Research, the winning teams, and a Kaggle or KDD Cup representative.
All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. All data releases will come within 24 hours of the specified day. The KDD Cup committee and organizers reserve the right to update the contest timeline if they deem it necessary as the contest progresses.

Data Files

File Name Available Formats
basicPythonBenchmark .csv (673.13 kb)
randomBenchmark .csv (673.13 kb)
basicCoauthorBenchmark .csv (673.13 kb)
dataRev2 .postgres (308.07 mb)
dataRev2 .zip (312.86 mb)
ValidSolution .csv (384.21 kb)
Test .csv (1013.49 kb)
TestPaper .csv (2.07 mb)
basicCoauthorBenchmarkRev2Test .csv (1013.49 kb)

Code to create the benchmarks is available on Github.

The files are available in two formats: as a zip archive containing CSV files (data.zip), and as a PostgreSQL relational database backup (data.postgres) that can be restored to an empty database. Instructions to restore the PostgreSQL datbase are on Github.

The dataset(s) for the challenge are provided by Microsoft Corporation and come from their Microsoft Academic Search (MAS) database. MAS is a free academic search engine that was developed by Microsoft Research, and covers more than 50 million publications and over 19 million authors across a variety of domains.

  • Author: is a publication author in the Academic Search dataset.
  • Paper: is a scholarly contribution written by one or more authors - could be of type conference or journal. Each paper also has additional metadata, such as year of publication, venue, keywords, etc.
  • Affiliation: the name of an organization with which an author can be affiliated. 

Dataset Descriptions:

The provided datasets are based on a snapshot taken in Jan 2013 and contain:


An Author dataset (Author.csv) with profile information about 250K authors, such as author name and affiliation. The same author can appear more than once in this dataset, for instance because he/she publishes under different versions of his/her name, such as J. Doe, Jane Doe, and J. A. Doe.

Name

Data Type

Comments

Id

Int

Id of the author

Name

Nvarchar

Author Name

Affiliation

Nvarchar

Organization name with which the author is affiliated.  


Paper dataset (Paper.csv) with data about 2.5M papers, such as paper title, conference/journal information, and keywords. The same paper may have been obtained through different data sources and hence have multiple copies in the dataset.

Name

Data Type

Comments

Id

Int

 Id of the paper

Title

Nvarchar

 Title of the paper

Year

Int

 Year of the paper

ConferenceId

Int

 Conference Id in which paper was published

JournalId

Int

 Journal Id in which paper was published

Keywords

Nvarchar

 Keywords of the paper 


A corresponding Paper-Author dataset (PaperAuthor.csv) with (paper ID, author ID) pairs. The Paper-Author dataset is noisy, containing possibly incorrect paper-author assignments that are due to author name ambiguity and variations of author names.

Name

Data Type

Comments

PaperId

Int

Paper Id

AuthorId

Int

Author Id

Name

Nvarchar

Author Name (as written on paper)

Affiliation

Nvarchar

Author Affiliation (as written on paper)


Since each paper is either a conference or a journal, additional metadata about conferences and journals is provided where available (Conference.csv, Journal.csv).

Name

Data Type

Comments

Id

 Int

Conference Id or Journal Id

ShortName

 Nvarchar

Short name

Fullname

 Nvarchar

Full name

Homepage

 Nvarchar

Homepage URL of conf/journal


Co-authorship can be derived from the Paper-Author dataset.

Papers that authors have "confirmed" (acknowledging they were the author) or deleted (meaning they were not the author) have been split into Train, Validation, and Test sets based on the author's Id. The Train.csv and Valid.csv sets are provided now, and the Test.csv set will be released later in the competition.

Name

Data Type

Comments

AuthorId

 Int

Id of the author

DeletedPaperIds 

 Nvarchar

Space-delimited set of deleted papers (Train only)

ConfirmedPaperIds

 Nvarchar

Confirmed papers (Train only)

PaperIds

 Nvarchar

PaperIds to rank from most likely to be deleted to least likely (Valid and Test only)

 

KDD Cup 2012: Predict which users (or information sources) one user might follow in Tencent Weibo

Background

Online social networking services have become tremendously popular in recent years, with popular social networking sites like Facebook, Twitter, and Tencent Weibo adding thousands of enthusiastic new users each day to their existing billions of actively engaged users. Since its launch in April 2010, Tencent Weibo, one of the largest micro-blogging websites in China, has become a major platform for building friendship and sharing interests online. Currently, there are more than 200 million registered users on Tencent Weibo, generating over 40 million messages each day. This scale benefits the Tencent Weibo users but it can also flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users’ interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature social networking websites like Tencent Weibo. 

Task 1 Description

The prediction task involves predicting whether or not a user will follow an item that has been recommended to the user. Items can be persons, organizations, or groups and will be defined more thoroughly below. 

Data Sets

First, we define some notations as follows:

Item”: An item is a specific user in Tencent Weibo, which can be a person, an organization, or a group, that was selected and recommended to other users. Typically, celebrities, famous organizations, or some well-known groups were selected to form the ‘items set’ for recommendation. The size of this is about 6K items in the dataset. 

Items are organized in categories; each category belongs to another category, and all together they form a hierarchy. For example, an item, a vip user Dr. Kaifu LEE,

vip user: http://t.qq.com/kaifulee (wikipedia:http://en.wikipedia.org/wiki/Kai-Fu_Lee)

represented as

  • science-and-technology.internet.mobile

We can see that categories in different levels are separated by a dot ‘.’, and the category information about an item can help enhance your model prediction. For example, if a user Peter follows kaifulee, he may be interested in the other items of the category that kaifulee belongs to, and might also be interested in the items of the parent category of kaifulee’s category.

Tweet”: a “tweet” is the action of a user posting a message to the microblog system, or the posted message itself. So when one user is “tweeting“, his/her followers will see the “tweet”.

Retweet”: a user can repost a tweet and append some comments (or do nothing), to share it with more people (my followers).

Comment”: a user can add some comments to a tweet. The contents of the comments  will not be automatically pushed to his/her followers as ‘tweeting’ or ‘retweeting’,but will appear at the ‘comment history’ of the commented tweet.

Followee/follower”: If User B is followed by User A, B is a followee to A, and A is a follower to B.

We describe the datasets as follows:

The dataset represents a sampled snapshot of Tencent Weibo users’ preferences for various items –– the recommendation of items to users and the history of users’ ‘following’ history. It is of a larger scale compared to other publicly available datasets ever released. Also it provides richer information in multiple domains such as user profiles, social graph, item category, which may hopefully evoke deeply thoughtful ideas and methodology.

The users in the dataset, numbered in millions, are provided with rich information (demographics, profile keywords, follow history, etc.) for generating a good prediction model. To protect the privacy of the users, the IDs of both the users and the recommended items are anonymized as random numbers such that no identification is revealed. Furthermore, their information, when in Chinese, will be encoded as random strings or numbers, thus no contestant who understands Chinese would get advantages. Timestamps for recommendation are given for performing session analysis.

Two datasets in 7 text files, downloadable:

a) Training dataset : some fields are in the file rec_log_train.txt

b) Testing dataset: some fields are in the file rec_log_test.txt

Format of the above 2 files:

(UserId)\t(ItemId)\t(Result)\t(Unix-timestamp)

Result: values are 1 or -1, where 1 represents the user UserId accepts the recommendation of item ItemId and follows it (i.e., adds it to his/her social network), and -1 represents the user rejects the recommended item.

We provide the true values of the ‘Result’ field in rec_log_train.txt, whereas in  rec_log_test.txt, the true values of the ‘Result’ field are withheld (for simplicity, in the file they are always 0). Another difference from rec_log_test.txt to rec_log_train.txt is that repeated recommended (UserId,ItemId) pairs were removed.

c)      More fields of the training and the testing datasets about the user and the item are in the following 5 files:

          i.              User profile data: user_profile.txt

Each line contains the following information of a user: the year of birth, the gender, the number of tweets and the tag-Ids. It is important to note that information about the users to be recommended is also in this file.

Format:

(UserId)\t(Year-of-birth)\t(Gender)\t(Number-of-tweet)\t(Tag-Ids)

Year of birth is selected by user when he/she registered.

Gender has an integer value of 0, 1, or 2, which represents “unknown”, “male”, or “female”, respectively.

Number-of-tweet is an integer that represents the amount of tweets the user has posted.

Tags are selected by users to represent their interests. If a user likes mountain climbing and swimming, he/she may select "mountain climbing" or "swimming" to be his/her tag. There are some users who select nothing. The original tags in natural languages are not used here, each unique tag is encoded as an unique integer.

Tag-Ids are in the form “tag-id1;tag-id2;...;tag-idN”. If a user doesn’t have tags, Tag-Ids will be "0".

        ii.              Item data: item.txt

Each line contains the following information of an item: its category and keywords.

Format:

(ItemId)\t(Item-Category)\t(Item-Keyword)

Item-Category is a string “a.b.c.d”, where the categories in the hierarchy are delimited by the character “.”, ordered in top-down fashion (i.e., category ‘a’ is a parent category of ‘b’, and category ‘b’ is a parent category of ‘c’, and so on.

Item-Keyword contains the keywords extracted from the corresponding Weibo profile of the person, organization, or group. The format is a string “id1;id2;…;idN”, where each unique keyword is encoded as an unique integer such that no real term is revealed.

      iii.              User action data: user_action.txt

The file user_action.txt contains the statistics about the ‘at’ (@) actions between the users in a certain number of recent days.

Format:

(UserId)\t(Action-Destination-UserId)\t(Number-of-at-action)\t(Number-of-retweet )\t(Number-of-comment)

If user A wants to notify another user about his/her tweet/retweet/comment, he/she would use an ‘at’ (@) action to notify the other user, such as ‘@tiger’ (here the user to be notified is ‘tiger’)..

For example, user A has retweeted user B 5 times, has “at” B 3 times, and has commented user B 6 times, then there is one line “A   B     3     5     6” in user_action.txt.

       iv.              User sns data: user_sns.txt

The file user_sns.txt contains each user’s follow history (i.e., the history of following another user). Note that the following relationship can be reciprocal.

Format:

(Follower-userid)\t(Followee-userid)

         v.              User key word data: user_key_word.txt

The file user_key_word.txt contains the keywords extracted from the tweet/retweet/comment by each user.

Format:

(UserId)\t(Keywords)

Keywords is in the form “kw1:weight1;kw2:weight2;…kw3:weight3”.

Keywords are extracted from the tweet/retweet/comment of a user, and can be used as features to better represent the user in your prediction model. The greater the weight, the more interested the user is with regards to the keyword.

Every keyword is encoded as a unique integer, and the keywords of the users are from the same vocabulary as the Item-Keyword. 

Evaluation

Teams’ scores and ranks on the leaderboard are based on a metric calculated from the predicted results in submitted result file and the held out ground truth of a validation dataset whose instances were a fixed set sampled from the testing dataset in the beginning and, until the last day of the competition (June 1, 2012) by then the scores and associated ranks on leaderboard are based on the predicted results and that of the rest of the testing dataset. This entails that the top-3 ranked teams at the time when the competition ends are the winners. The log for forming the training dataset corresponds to earlier time than that of the testing dataset.

The evaluation metric is average precision. For a detailed definition of the metric, please refer to the tab ‘Evaluation’. 

Prizes

The prizes for the 1st, 2nd and 3rd winners for task 1 are US Dollars $5000, $2000, and $1000, respectively.

Data Files

File Name Available Formats
track1 .7z (460.24 mb)
.zip (661.77 mb)
rec_log_test.txt .7z (139.51 mb)
.zip (185.45 mb)
example_submission_2column_header.csv .zip (9.41 mb)
example_submission_1column_noheader.csv .zip (5.96 mb)
KDD Cup Track 1 Data .torrent (12.80 kb)
KDD_Track1_solution .csv (45.22 mb)

You only need to download one format of each file.
Each has the same contents but use different packaging methods.

track1.7z and track1.zip contain the same files - you only need to download one format
 
track1.7z contains the following compressed files:
 
Filename                            Available Format
--------------------------------------------------------------
rec_log_train                    .txt (1.99Gb)
user_profile                     .txt (55.8Mb)
item                                   .txt (1.18Mb)
user_action                     .txt (217Mb)
user_sns                          .txt (740Mb)
user_key_word               .txt (182Mb)

KDD Cup 2012: Predict the click-through rate of ads given the query and user information

Task 2 Description

Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.

Training Data File

The training data file is a text file, where each line is a training instance derived from search session log messages. To understand the training data, let us begin with a description of search sessions.   

A search session refers to an interaction between a user and the search engine. It contains the following ingredients: the user, the query issued by the user, some ads returned by the search engine and thus impressed (displayed) to the user, and zero or more ads that were clicked by the user. For clarity, we introduce a terminology here. The number of ads impressed in a session is known as the ’depth’. The order of an ad in the impression list is known as the ‘position’ of that ad. An Ad, when impressed, would be displayed as a short text known as ’title’, followed by a slightly longer text known as the ’description’, and a URL (usually shortened to save screen space) known as ’display URL’.   

We divide each session into multiple instances, where each instance describes an impressed ad under a certain setting  (i.e., with certain depth and position values).  We aggregate instances with the same user id, ad id, query, and setting in order to reduce the dataset size. Therefore, schematically, each instance contains at least the following information:

  • UserID
  • AdID
  • Query 
  • Depth 
  • Position 
  • Impression 

the number of search sessions in which the ad (AdID) was impressed by the user (UserID) who issued the query (Query).

  • Click 

the number of times, among the above impressions, the user (UserID) clicked the ad (AdID).   

Moreover, the training, validation and testing data contain more information than the above list, because each ad and each user have some additional properties. We include some of these properties into the training, validation  and the testing instances, and put other properties in separate data files that can be indexed using ids in the instances. For more information about these data files, please refer to the section ADDITIONAL DATA FILES. 

Finally, after including additional features, each training instance is a line consisting of fields delimited by the TAB character: 

1. Click: as described in the above list. 

2. Impression: as described in the above list. 

3. DisplayURL: a property of the ad. 

The URL is shown together with the title and description of an ad. It is usually the shortened landing page URL of the ad, but not always. In the data file,  this URL is hashed for anonymity. 

4. AdID: as described in the above list. 

5. AdvertiserID: a property of the ad. 

Some advertisers consistently optimize their ads, so the title and description of their ads are more attractive than those of others’ ads. 

6. Depth: a property of the session, as described above.   

7. Position: a property of an ad in a session, as described above. 

8. QueryID:  id of the query. 

This id is a zero‐based integer value. It is the key of the data file 'queryid_tokensid.txt'.

9. KeywordID: a property of ads. 

This is the key of  'purchasedkeyword_tokensid.txt'. 

10. TitleID: a property of ads. 

This is the key of 'titleid_tokensid.txt'. 

11. DescriptionID: a property of ads. 

 This is the key of 'descriptionid_tokensid.txt'. 

12. UserID

This is the key of 'userid_profile.txt'.  When we cannot identify the user, this field has a special value of 0.

Additional Data Files

There are five additional data files, as mentioned in the above section: 

1. queryid_tokensid.txt 

2. purchasedkeywordid_tokensid.txt 

3. titleid_tokensid.txt 

4. descriptionid_tokensid.txt 

5. userid_profile.txt 

Each line of the first four files maps an id to a list of tokens, corresponding to the query, keyword, ad title, and ad description, respectively. In each line, a TAB character separates the id and the token set.  A token can basically be a word in a natural language. For anonymity, each token is represented by its hash value.  Tokens are delimited by the character ‘|’. 

Each line of ‘userid_profile.txt’ is composed of UserID, Gender, and Age, delimited by the TAB character. Note that not every UserID in the training and the testing set will be present in ‘userid_profile.txt’. Each field is described below: 

1. Gender: 

'1'  for male, '2' for female,  and '0'  for unknown. 

2. Age: 

'1'  for (0, 12],  '2' for (12, 18], '3' for (18, 24], '4'  for  (24, 30], '5' for (30,  40], and '6' for greater than 40. 

Testing Data Set

The testing dataset shares the same format as the training dataset, except for the counts of ad impressions and ad clicks that are needed for computing the empirical CTR. A subset of the testing dataset is used to consistently rank submitted/updated results on the leaderboard. The testing dataset is used for picking the final winners.

The log for forming the training dataset corresponds to earlier time than that of the testing dataset.

Evaluation

Teams are expected to submit their result file in text format, in which each line corresponds to a line in the downloaded file with the same order, and there is only one field in each line: the predicted CTR. In the result file, the lines corresponding to the lines from validation dataset will be used to score for the ranking on the leaderboard during the competition except the last day (June 1, 2012), and the lines corresponding to the lines from testing dataset will be used for the ranking on the leaderboard on the day of June 1, 2012, and for picking the final winners.

The performance of the prediction will be scored in terms of the AUC (for more details about AUC, please see ‘ROC graphs: Notes and practical considerations for researchers‘ by Tom Fawcett). For a detailed definition of the metric, please refer to the tab ‘Evalaution’.

Prizes

Teams with the best performance scores will be the winners. The prizes for the 1st, 2nd and 3rd winners for task 2 are US Dollars $5000, $2000, and $1000, respectively.

Data Files

File Name Available Formats
user_benchmark.csv .gz (4.03 mb)
advertiser_benchmark.csv .gz (32.41 mb)
ad_benchmark.csv .gz (41.07 mb)
keyword_benchmark.csv .gz (40.07 mb)
mean_benchmark.csv .gz (384.53 kb)
track2 .7z (1.86 gb)
.zip (2.68 gb)
basic_id_benchmark .py (2.35 kb)
KDD_Track2_solution .csv (243.64 mb)
test_score_kdd .py (1.47 kb)
title_benchmark.csv .gz (38.21 mb)
description_benchmark.csv .gz (37.58 mb)
scoreKDD .py (7.06 kb)
query_benchmark.csv .gz (28.70 mb)
test .7z (202.83 mb)
.zip (356.25 mb)

You only need to download one format of each file.
Each has the same contents but use different packaging methods.

track2.7z and track2.zip contain the same files - you only need to download one format
 
track2.7z contains the following compressed files:
 
Filename                            Available Format
--------------------------------------------------------------
training                                    .txt (9.9Gb)
queryid_tokensid                    .txt (704Mb)
purchasedkeywordid_tokensid    .txt (26Mb)
titleid_tokensid                            .txt (172Mb)
descriptionid_tokensid                 .txt (268Mb)
userid_profile                               .txt (284Mb)
 

KDD Cup 2011: The Yahoo! Music Dataset and KDD-Cup'11

 
Gideon Dror
Yahoo! Labs
 
Noam Koenigstein
School of Electrical Engineering
Tel Aviv University
 
Yehuda Koren
Yahoo! Labs
 
Markus Weimer
Yahoo! Labs
 
Editor: G. Dror, Y. Koren and M. Weimer

Abstract

KDD-Cup 2011 challenged the community to identify user tastes in music by leveraging Yahoo! Music user ratings. The competition hosted two tracks, which were based on two datasets sampled from the raw data, including hundreds of millions of ratings. The underlying ratings were given to four types of musical items: tracks, albums, artists, and genres, forming a four level hierarchical taxonomy.
 
The challenge started on March 15, 2011 and ended on June 30, 2011 attracting 2389 participants, 2100 of which were active by the end of the competition. The popularity of the challenge is related to the fact that learning a large scale recommender systems is a generic problem, highly relevant to the industry. In addition, the contest drew interest by introducing a number of scientific and technical challenges including dataset size, hierarchical structure of items, high resolution timestamps of ratings, and a non-conventional ranking-based task.
 
This paper provides the organizers' account of the contest, including: a detailed analysis of the datasets, discussion of the contest goals and actual conduct, and lessons learned throughout the contest.

1. Introduction

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse, as nicely exhibited by the famous quotation: \We don't like their sound, and guitar music is on the way out" (Decca Recording Co. rejecting the Beatles, 1962).
 
Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, how the popularity of songs, albums and artists vary over time and above all, which songs users would like to listen to. Such an analysis introduces new scientific challenges. We have created a large scale music dataset and challenged the research world to model it through the KDD Cup

2011 contest 1. The contest released over 250 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items: tracks, albums, artists, genres, all tied together within a known taxonomy. Thousands of teams participated in the
contest, trying to crack the unique properties of the dataset.

2. The Yahoo! Music Dataset

2.1. Yahoo! Music Radio service

Yahoo! Music 2 was one of the first providers of personal(ized) internet music radio stations, with a database of hundreds of thousands of songs. As a pioneer in online music streaming, it influenced many subsequent services. Yahoo! Music used to be the top ranked online
music site in terms of audience reach and total time spent. The service was free with some commercial advertising in between songs that could be removed by upgrading to a premium account. Users could rate songs, artists, albums and even genres on a 5 star system, or using a slider interface. These ratings were used by Yahoo! Music to generate recommendations that match the user's taste, based on either the taxonomy of items or on recommendations of other users with similar musical tastes.

2.2. Ratings dataset

The KDD-Cup contest released two datasets based on Yahoo! Music ratings. The larger dataset was created for Track1 of the contest, and a smaller dataset was created for Track2. The two datasets share similar properties, while the dataset for Track2 omits dates and times and refers to a smaller user population. In addition, the distinctive nature of the tasks dictates using different kinds of test sets. This section presents a statistical analysis of the dataset of Track1, which is the richer and larger of the two. Since both datasets were sampled from the same source, both are characterized by many similar patterns. More details specific to compilation of the Track2 dataset are given in Sec. 3.

The Track1 dataset comprises 262,810,175 ratings of 624,961 music items by 1,000,990 users collected during 1999-2010. The ratings include one-minute resolution timestamps, allowing refined temporal analysis. Each item and each user have at least 20 ratings in the whole dataset. The available ratings were split into train, validation and test sets, such that the last 6 ratings of each user were placed in the test set and the preceding 4 ratings were used in the validation set. The train set consists of all earlier ratings (at least 10). The total sizes of the train, validation and test sets are therefore 252,800,275, 4,003,960, and 6,005,940 ratings, respectively.


Fig. 1 depicts the distribution of the number of ratings per item and number of ratings per user. Both distributions exhibit clear power law characteristics with a long tail of popular" items and very active users, and a very large set of items and users associated with very few ratings. Indeed the sparsity factor of the dataset - 99.96% is relatively high when compared to other collaborative filtering datasets; the Net ix dataset, for example, has a 98.82% (Bennett and Lanning, 2007) sparsity factor.

1. http://kddcup.yahoo.com
2. http://music.yahoo.com

Figure 1: Power-law distribution of the number of ratings per user and per item

KDDCup2011Figure1.png

Fig. 2 depicts a histogram of the ratings in the train and validation sets using a logarithmic vertical scale. The vast majority of the ratings are multiples of ten. This distribution results from the different user interfaces that generally encouraged users to provide round ratings.
Figure 2: Rating distribution in the training and validation datasets

KDDCup2011Figure2.png

Specifically, the popularity of a widget used to enter ratings at a 1-to-5 star scale is reflected by the dominance of the peaks at 0; 30; 50; 70 and 90 into which the star ratings were translated. Notice that the distributions of ratings of the train and validation sets differ significantly on lower ratings. This is caused by the fact that the validation set has exactly 4 ratings per user, thus significantly down-sampling the heavy raters who are responsible for many of the lower ratings, an effect that will be detailed shortly. The different sampling also results in a significantly higher mean rating in the validation set (68.23) and the test set (68.34) compared to the training set (48.87). The different statistical characteristics of the datasets were an impediment that some participants found difficult to handle.

 

Fig. 3(a) depicts the distribution of mean ratings of users and items. The location of the modes (at 89 and 50 respectively) and the variances of the two distributions are quite distinct. In addition, the distribution of the mean user ratings is significantly more skewed. To understand this difference, we grouped users into sets characterized by the user activity. Each set consisted of users such that the number of items they rated falls in a certain range, e.g., between 100 and 120 ratings. For each set of users we calculated their mean rating and the mean number of ratings. We followed the same procedure for items, to produce the graphs in Figure 3(b). The graph for users (red) reveals an important feature of the Yahoo! Music dataset: the more ratings a user did, the lower her mean rating is. Specifically, "heavy" users, those who rate tens of thousands of items, have mean ratings of around 30. One possible explanation of this effect is that users who rate more items tend to encounter and rate items that don't match their musical taste, and thus the rating scores tend to be lower. This pattern explains the difference between the distribution of ratings in the training and validation sets, pronounced in lower ratings at Fig. 2. Unlike the training set, the validation and test set, equally weigh all users, and are therefore dominated by the many light-raters whose average rating is considerably higher.
Figure 3: Item and user rating characteristics
KDDCup2011Figure3.png
 
With respect to items, the picture is slightly more complicated: rare items as well as the most popular items have a relatively high mean rating, while items with medium number of ratings tend to fall considerably below the average rating.
 
A unique characteristic of the Yahoo! Music dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. Moreover, all rated items are tied together within a taxonomy. That is, for a track we know the identity of its album, performing artist and associated genres. There is exactly one artist and one album associated with each track but possibly several genres. Similarly, we have artist and genres annotation for the albums. There is no genre information for artists, as artists may switch between many genres in their career. The same song that is performed by different artists on different albums is considered as a different track. Similarly, when two or more artists collaborate they receive a new artist label that is different than each of the original labels. We measured the correlation coefficient of ratings across the hierarchy and discovered strong ties between related items. The correlation coefficient of track ratings by the same artist is 0.418 and 0.494 for tracks on the same album. The correlation coefficient of track ratings and their album ratings is 0.411, and 0.416 for track ratings and their artist ratings. The correlation of album ratings with ratings of their artists is 0.467 and especially correlated are ratings of albums by the same artist with a correlation coefficient of 0.608. In Sec. 5 we discuss the benefits of incorporating taxonomy into predictive models in each track.
 
The majority of items (81.15%) are tracks, followed by albums (14.23%), artists (4.46%) and genres (0.16%). The ratings however, are not uniformly distributed: Only 46.85% of the ratings belong to tracks, followed by 28.84% to artists, 19.01% to albums and 5.3% to genres. Moreover, these proportions may vary signifcantly between users. Specifically, we found that heavier raters cover more of the numerous tracks, while the light raters mostly concentrate on artists; Interestingly, all four item types have very similar mean rating, 49.17, 45.12, 51.04 and 47.83 to tracks, albums, artists and genres respectively.
 
Several temporal phenomena affected the rating patterns of users. First, over the years the repository of items evolved considerably. Second, the service featured different rating user interfaces (\widgets") with different appearances, which have been changing over the
years, encouraging different rating patterns. Thirdly, social, economical, and even technological phenomena are likely to have a non-trivial effect on the popularity of items and usage patterns of the service. Fig. 4(a) depicts the mean weekly rating value vs. the number of weeks that passed since the launch of the service. The dashed line represents the overall mean rating. We can see that the first 125 weeks are characterized by strong fluctuations in the mean rating. This is also related to a relatively low overall activity, with an average of about thirty thousand ratings per week. Later changes in the Yahoo! Music service resulted in a substantial increase in the number of ratings and the mean rating values start to stabilize. Considerable changes in the website's interface occurred around the 125-th and 225-th weeks which are expressed as noticeable changes in the mean rating in g. 4(a).
Figure 4: Temporal statistics. Time is measured in weeks since the first rating

KDDCup2011Figure4.png

 
Fig. 4(b) depicts the number of ratings per each item type (track, album, artist, genre) vs. the number of weeks that passed since the launch of the service on 1999. Noticeable changes in the distribution of types, correlated with substantial change in mean rating, can be observed around the 125-th week, when the rating activity of artists increases by an order of magnitude, and on the 225-th week, when rating activity of both albums and artists increases significantly.
 
The contest offered two dffierent tasks. Track1 calls for predicting users' rating of musical items. Items can be tracks, albums, artists and genres. Each user and item have at least 20 ratings in the dataset (train, validation and test sets combined). A detailed description of the provided dataset was given in Sec. 2. The main dataset statistics are described in Table 1. Each user has at least 10 ratings in the training data. Then, each user has exactly four ratings in the validation data, which come later in time than the ratings by the same user in the training data. Finally, the test data holds the last 6 ratings of each user. The contestants were asked to provide predictions for these test scores. The evaluation criterion is the Root Mean Squared Error (RMSE) between predicted ratings and true ones. Formally: 
 
KDDCup2011Equation1.png
 
 
Where (u; i) 2 T1 are all the user-item pairs in Track1 test data set, r ui is the actual rating of user u to item i, and ^r ui is the predicted value for r ui. This amounts to a traditional collaborative filtering rating prediction task.
Table 1: Datasets size statistics
Task  #users #items Train Validation Test
Track1 1,000,990 624,961 252,800,275 4,003,960 6,005,940
Track2  249,012 296,111 61,944,406 N/A  607,032
 
Track2 involves a less conventional task. A similar music rating dataset was compiled, covering 4 times less users than Track1. Once again, each user and item have at least 20 ratings in the dataset (train and test sets combined); see Table 1 for the main dataset
statistics. The train set includes ratings (scores between 0 to 100) of both tracks, albums, artists and genres performed by Yahoo! Music users.
 
For each user participating in the Track2 test set, six items are listed. All these items must be tracks (not albums, artist or genres). Three out of these six items have never been rated by the user, whereas the other three items were rated \highly" by the user, that is, scored 80 or higher. The three items rated highly by the user were chosen randomly from the user's highly rated items, without considering rating time. The three test items not rated by the user are picked at random with probability proportional to their odds to receive \high" (80 or higher) ratings in the overall population. More specifically, for each user a rating is sampled uniformly from all track ratings which are 80 or higher. The track associated with the rating is selected if it has not been rated by the user and has not been already selected as an unrated item for the user. The sampling for each user terminates when three different tracks, disjoint to the tracks rated by the user, have been selected.
Note that many users do not participate in this test set at all.
 
The goal of such a task would be differentiating high ratings from missing ones. Participants were asked to identify exactly three highly rated items for each user included in the test. The evaluation criterion is the error rate, which is the fraction of wrong predictions.
 
We had several objectives in designing the second track of the contest. Initially, we wanted to add a dataset of a lower scale, in order to appeal to competitors with less capable systems, which might get discouraged by the large size of the Track1 dataset. More importantly, the Track2 dataset provided us an opportunity to experiment with a less established evaluation methodology, which may better relate to real life scenarios. The proposed metric is related to the common recommendation task of predicting the items that the user will like (rather than predicting the rating value of items). Furthermore, the task of Track2 requires extending the generalization power of the learning algorithm to the truly missing entries (items that were never rated by the user), as required in real life scenarios. Finally, the way we drew negative examples (in proportion to their dataset popularity), discourages known trivial solutions to the top-K recommendation task, where most popular items are always suggested, regardless of the user taste.
 
We decided to exclude timestamps (both dates and times) from the Track2 dataset. Our purpose was to center the focus of the time-limited competition on a specific aspect of the problem. In particular, we assume that there are many strong short-term correlations between various actions of a user, which would greatly help in separating rated from unrated items. For example, some users have the tendency to listen to multiple tracks of the same album in a row. We wanted to encourage solutions catering to the longer-term user taste, rather than analyzing short term trends, thus we hid the temporal information in Track2. At both tracks, the test set was internally split into two equal halves known as Test1 and Test2, in a way not disclosed to the competitors. Participants were allowed to submit their test set predictions every 8 hours, and receive feedback on their Test1 performance that was reflected in the public leader-board. The true rankings of competitors were based on Test2 scores, which were not disclosed during the competition.
 
Comparison to real world evaluation
Real world systems usually perform recommendations in two phases: The first phase is the modeling phase, in which a prediction model
is trained to capture the patterns in the dataset. The model's quality is usually measured using an evaluation criteria such as RMSE, precision at K or error rate, similar to the performance measure used in Track2. The second phase is retrieval of recommendations in which a retrieval algorithm is developed, sometimes with the help of marketing experts, which uses the model to determine which recommendations should be presented to the users at each time. The most naive retrieval algorithm ranks the items for each user using the predicted rating and recommends her the top ranking items. However, business-wise it is usually better to diversify the set of recommended items and to control how often an item can be recommended. This comes at the cost of choosing items that are not ranked highest, but are still liked by the user. The overall quality of a real world systems is usually evaluated by some Key Performance Indicator (KPI) such as the clickthrough rate or purchase rate of recommended items. This of course, is very difficult to reconstruct in a competition
framework.

4. Contest conduct

The contest took part over a period of three and a half months from March 15 till June 30, 2011. The datasets were made available to contestants on the first day of the competition. In addition, the contestants could register to the contest and have access to sample data to
prepare their software two weeks in advance of the beginning of the competition. During the competition, each team was allowed to submit one solution every eight hours for each track of the competition. The submission system provided the contestants with immediate feedback on the performance of the submitted solution on the test set. The contest website also featured a public leader-board that showed an up-to-date ranking of the best submissions so far (based on scores on the Test1 set).
 
The registration system required users to form teams and to be part of only one team at a time. Teams could leave the competition by marking themselves as inactive. The contestants made heavy use of that functionality, but not to actually leave the contest. Instead, it was used mostly to regroup: Out of the 2389 users that registered for the contest, 2100 were still part of an active team at the end of the competition. During the competition, 4252 teams were registered (more than users!) of which 1287 were active at the end of the competition.
 
Each of the two contest tracks had about the same number of teams submitting to it: 1878 teams ever submitted to Track1, 1854 to Track2. The same is not true for the number of submissions. During the competition, there were 6344 solutions submitted to Track1, while Track2 received \only" 5542 submissions. Table 2 shows a histogram of the number of submissions per track and how many teams submitted that many times. Fig. 5 shows the number of submissions per day and track. As can be clearly seen from this plot, the contestants were eager to (re-)submit their best solutions on the last day of the competition. The almost equal number of submissions and teams submitting is somewhat surprising, because the leadership of Track2 was much more volatile than that of Track1 as can be seen in Fig. 6(a) and 6(b). During the competition, the top spot on the public leader-board changed 52 times for Track1 and almost twice as often, 96 times, for Track2. Even more drastically, only 12 teams ever held the top spot in Track1 during the competition, while 35 teams held that position for some period of time in Track2. We believe that this greater volatility in the leadership of Track2 can be attributed to the novel task to be solved in that track of the competition.
Figure 5: Submissions per day
KDDCup2011Figure5.png
Figure 6: Leadership and Performance over time
KDDCup2011Figure6.png
 
According to the rules of the competition, the last submission of each team was used for determining the winners, not the best. Thus, the question arises whether all teams managed to submit their best solution in time or whether an earlier submission actually would have won. Much to the relief of the teams involved, the order of the top spots in the leader-board would be the same if we would have used the best submission. Similarly, for the top spots, the publicly reported rankings based on the Test1 set, equal the ranking based on the undisclosed Test2 set, meaning that contestants did not strongly over-t the Test1 set. As becomes apparent from Figures 7(a) and 7(b), the final submissions from all teams participating till the end are also fairly spread-out in terms of performance.
Table 2: Submissions statistics: Number of submissions per team
Submissions Teams Track1 Teams Track2
0-9 1728 1738
10-19 80 65
20-29 36 29
30-39 13 2
40-49 8 7
50-59 4 6
60-69 0 1
70-79 4 1
80-89 1 1
90-99 2 2
100-109 0 1
110-119 1 0
120-129 1 1

Table 3 details the three winning teams on Track1 and on Track2, together with their performances on Test1 and Test2. We observe that the performance of all winners on both tracks is better on Test2 than on Test1.

Table 3: The winning teams and their performances on Test1 and Test2.
Track1 Track2
Team Rmse-Test1 Rmse-Test2 Team  Error-Test1 Error-Test2
NTU 21.0147  21.0004 NTU 2.4749 2.4291
Commendo 21.0815 21.0545 Lemon 2.4808 2.4409
InnerPeace 21.2634 21.2335 Commendo 2.4887 2.4442

 

NTU is an acronym for \National Taiwan University" and Lemon is a shorthand for "The Art of Lemon". Full details on the winning teams can be found in the contest website. 

5. Lessons

This JMLR volume includes 13 papers describing the techniques employed by top teams. Short versions of these papers were previously published in the KDD-Cup'11 workshop. These papers describe many interesting algorithms, which enhance different recommendation methods. We encourage the reader to look at these papers for more details. Here, we will highlight some of the higher-level lessons that we take away from the competitors' works.
 
The Track1 solutions followed much of the techniques that were found successful at the Netflix Prize competition (Bennett and Lanning, 2007), which aimed at the same RMSE metric. In order to provide a proper context, we will shortly describe the related Netflix Prize setup. Netflix published a dataset including more than 100 million movie ratings that were performed by about 480,000 users on 17,770 movies. Each rating is an integer from one (worst) to 5 (best) and is associated with a date. The Netflix Prize challenge called competitors to predict 2.8 million test ratings, with accuracy judged by RMSE. The winning solution could yield a test RMSE of 0.8556 concluding a massive three year effort.
 
In a sense, reported Track1 results reinforce many findings on the Netflix data, which still hold despite the usage of a relatively different dataset. Solutions were created by blending multiple techniques, including nearest neighbor models, restricted Boltzmann machines and matrix factorization models. Among those, matrix factorization models proved to perform best. Modeling temporal changes in users' behavior and items' popularity was essential for improving solution quality. Among nearest neighbors models, item-item models outperformed user-user models. Such a phenomenon is notable due to the fact that the number of items in our dataset is large and roughly equals number of users, hence one could have expected that in such a setup user-user and item-item techniques will deliver similar performance. The given item taxonomy helped in accounting for the large number of sparsely rated items, however its effect was relatively low in terms of RMSE reduction.
 
The best result achieved on Track1 was RMSE=21 (Chen et al., 2011a; Jahrer and Toscher, 2011), which was achieved by an ensemble blending many solutions. The best result by a single (non-blended) method was RMSE=22.12 (Chen et al., 2011b). This pretty much confirms our pre-contest expectations based on Net ix Prize experience. Note that Netflix Prize results need to get calibrated in order to account for the different rating scales. While in the Net ix Prize the ratings range (difference between highest (=5) and lowest (=1) rating) is 4, in our dataset the ratings range is 100. Hence, if one linearly maps our ratings to the 1{5 stars scale, RMSE of the same methods will shrink by a factor of 25. This means that on Netflix score range, the best score achieved at Track1 is equivalent to 0.84, which is strikingly close to the best score known on the Netflix dataset (0.8556). Comparing the fraction of explained variance (known as R2), reveals more of a difference between Track1 and Netflix Prize. The total variance of the ratings in the Netflix test set is 1.276, corresponding to an RMSE of 1.1296 by a constant predictor. Three years of multi-team concentrated eorts reduced the RMSE to 0.8556, thereby leaving the unexplained ratings variance at 0.732. Hence, the fraction of explained variance is R2 = 42:6%, whereas the rest 57.4% of the ratings variability is due to unmodeled effects (e.g., noise). Moving to the Yahoo! Music Track1 test set, the total variance of the test dataset is 1084.5 (reflecting the 0-100 rating scale). The best submitted solution could reduce this variance to around 441, yielding R2 = 59:3%. Hence, with the Yahoo! Music dataset a considerably larger fraction of the variance is explainable by the models.
 
Track2 of the competition offered a less traditional metric: error rate of a classier separating loved tracks from unrated ones. We were somewhat surprised by the low error rate achieved by the contestants | 2.47% for best solution (McKenzie et al., 2011). This means that for over 97% of test tracks, the model could correctly infer whether they are among the loved ones. However, such a performance should be correctly interpreted in light of real life expectations. Our chosen metric contrasted the same amount of positive ("loved items") and negative examples. In real life, one would argue that positives and negatives are not balanced, but rather negatives grossly outnumber positives. This makes the task of identifying the "top-k" items harder.
 
In comparison to other ranking-oriented metrics, the employed error rate offers two benefits. First, it is fast to compute, without the need to rank positive items within a large set of background items. Second, it can be tailored to being neutral to popularity effects, which tend to greatly influence ranking-based metrics. This was achieved by sampling the negative examples by their overall popularity (i.e., popular items are sampled more frequently as negative examples), so that positive and negative examples come from the same distribution. This favors methods that go deeper than merely promoting the overall popular items. In other studies, we also experimented with a variant of the metric that picks negative examples uniformly, without over-emphasizing popular items. This tends to expose other aspects of performance, which shed additional light on the algorithms' properties.
 
A key technique shared by the Track2 solutions is sampling the missing ratings, to be used as negative examples for training the learning algorithm. This way, the proposed methods do not only model the observed ratings, but also those ratings sampled from the missing ones. This greatly improved performance at Track2. This also confirms recent works (Cremonesi et al., 2010; Steck, 2010) showing that models that account also for the missing ratings provide significant improvements over even sophisticated models that are optimized on observed ratings only. We note that as dictated by the test set balanced structure, competitors resorted to balanced sampling. That is, the number of sampled missing ratings equals to the number of non-missing ratings for the same user. This also bodes well with practices used when learning unbalanced classification models, where often training is performed on a balanced number of positive and negative examples. Yet, we would like to explore what would have happened had the objective function been different, such as the recall@K metric, which is not utilizing an equal number of positive and negative examples. Would we still want to construct a balanced training set with equal number
of positive and negative examples per user, or maybe we would like to use the full set of missing values as in (Cremonesi et al., 2010; Hu et al., 2008; Steck, 2010)?
 
The Track2 dataset did not expose timestamps for reasons explained in Sec. 3. However, in real life where such timestamps are available and considered highly beneficial (see, e.g., Lai et al. (2011)), the way negative examples are sampled becomes non-trivial and raises the
question of how would one attach a timestamp to the sampled negative examples. Given the importance of sampling methods at implicit-feedback recommenders, we believe that this is an important question deserving further research.
 
Another theme shared by many Track2 solutions is the usage of pair-wise ranking objective functions. Such schemes do not strive to recover the right score for each item, but rather maintain the correct ranking for each pair of items. Pair-wise ranking objective seems more natural and elegant in our ranking-oriented setup, where item ordering matters, not individual scores. A notable recommendation algorithm of this family is BPR (Rendle et al., 2009), which was indeed employed by several teams. In fact, the best reported result by a single (non-blended) method (error rate of 2.911%) is by an extension of BPR (Mnih, 2011). Yet, in future research we would like to further experiment with the value of pair-wise ranking solutions, compared to the arguably cheaper regression-based solutions, which try to recover each single user-item score (Hu et al., 2008; Steck, 2010). In fact, the top two leading teams on Track2 (Lai et al., 2011; McKenzie et al., 2011) utilized regression techniques which minimize squared loss of recovered user-item scores. Their single-method results were quite close to those reported by pair-wise ranking techniques.
 
Comparing results of Track1 and Track2 reinstates the known observation that achievable RMSE values on the test set lie in a quite compressed range. This was also evident at the Netflix contest where major modeling improvements could only slightly reduce RMSE.
This observation becomes much more striking when considering comparable improvements on the error rate metric applied at Track2. Let us analyze the influence of two components that improve modeling accuracy.
 
1. Dimensionality of the matrix factorization model On the RMSE front it was observed that after reaching 50-100 factors, there is barely an improvement when increasing the number of factors; see, e.g., Figure 7 of (Dror et al., 2011). However, the case is very different with Track2, where the error rate metric was used. Competitors have resorted to very high dimensionalities of matrix factorization models in order to improve performance. For example, McKenzie et al. (2011) (Table 2) reports error rate dropping from 6.56% to 4.25% when increasing dimensionality from 800-D to 3200-D.
 
2. Utilizing the given item taxonomy We have observed a quite subtle RMSE reduction from around 22.85 to 22.6 by utilizing the taxonomy; see (Dror et al., 2011). Similar RMSE reduction is also reported in Table 3 of Chen et al. (2011b). When moving to the error rate metric, utilization of the taxonomy results in much more significant improvements. Lai et al. (2011) reports a matrix factorization model (Bi-narySVD) achieving around 6% error rate, which is significantly reduced to 3.49% by considering the items taxonomy (BinarySVD+). Similarly, Mnih (2011) (Table 3) reports reducing the error rate from 5.673% to 3.314% by accounting for taxonomy in a latent factor model.
 
In conclusion, we have found a consistent evidence that the same modeling improvements that only modestly impact RMSE, have a substantial effect on the error rate metric. Similar evidence was also hinted elsewhere (Koren, 2008). Given the large popularity of the RMSE metric in recommenders research, it is important to interpret RMSE improvements well. Clearly, small RMSE changes can correspond to significantly different ranking of items. For example, RMSE minimizing algorithms tend to diminish variance by concentrating
all predictions close to averages, with small deviations from the average dictating item rankings. We believe that the impact of RMSE reduction on user experience is a topic deserving further exploration.
Figure 7: Histogram of the final submissions' performance

KDDCup2011Figure7.png

5.1. Maneuvers by the contestants
While the contest was clearly decided by algorithmic tools, we would like to highlight some interesting strategies used by the competitors that were tailored to the specific setup of the contest.
 
Recall that the test set was split into two halves: Test1 and Test2, whereas Test1 results were exposed on the Leader-board and Test2 results were used for determining the winners and remained undisclosed during the contest. Contestants have used the reported Test1
results in order to tweak their solutions. This was particularly doable for Track1 that used the RMSE metric. It can be shown that once observing RMSE of individual predictors, their optimal linear combination can be derived. Hence, contestants have created a linear
combination of individual submissions based on their Leaderboard-reported Test1 RMSE values. Interestingly, Balakrishnan et al. (2011) report using a variant of this technique even for the error-rate metric employed at Track2. In essence, the Test1 set partially served as an additional validation set. It is hard to evaluate the extent to which such a technique was signicant given that the best performing blending was achieved by non-linear combinations where such an RMSE-blending cannot work. It is important to note that the contestants have used this practice moderately and did not over-t the Test1 set. In fact, final ranking of top teams remain the same whether scored by Test1 or by Test2.
 
We also observed some competitors hiding their true performance on Track2 by making inverted submissions (where predicted positives and negatives are exchanged). The resulting "flipped" submissions was scored exactly 100% minus the error-rate of the original predictor,
and hence would land close to the bottom of the Leaderboard. We are not certain as to why teams have employed this practice, but it could have been avoided by always reporting the lower error-rate among the actual submission and its inversion.
 
Finally, Lai et al. (2011) have nicely found that ratings in Track2 are arranged in a chronological order, which allowed them to use temporal information which was not explicitly available for Track2. This has proved very significant to reducing their error rate.

References

1
Suhrid Balakrishnan, Rensheng Wangand Carlos Scheidegger, Angus MacLellan, Yifan Hu, Aaron Archer, Shankar Krishnan, David Applegate, Guang Qin Ma, and S. Tom Au. Combining predictors for recommending music: the false positives' approach to kdd cup track 2. In KDD-Cup'11 Workshop, 2011.
2
J. Bennett and S. Lanning. The netflix prize. In Proc. KDD Cup and Workshop, 2007.
3
Po-Lung Chen, Chen-Tse Tsai, Yao-Nan Chen, Ku-Chun Chou, Chun-Liang Li, Cheng-Hao Tsai, Kuan-Wei Wu, Yu-Cheng Chou, Chung-Yi Li, Wei-Shih Lin, Shu-Hao Yu, Rong-Bing Chiu, Chieh-Yen Lin, Chien-Chih Wang, Po-Wei Wang, Wei-Lun Su, Chen-Hung Wu, Tsung-Ting Kuo, Todd G. McKenzie, Ya-Hsuan Chang, Chun-Sung Ferng, Chia-Mau Ni, Hsuan-Tien Lin, Chih-Jen Lin, and Shou-De Lin. A linear ensemble of individual and blended models for music rating prediction. In KDD-Cup'11 Workshop, 2011a.
4
Tianqi Chen, Zhao Zheng, Qiuxia Lu, Xiao Jiang, Yuqiang Chen, Weinan Zhang, Kailong Chen, Yong Yu, Nathan N. Liu, Bin Cao, Luheng He, and Qiang Yang. Informative ensemble of multi-resolution dynamic factorization models. In KDD-Cup'11 Workshop, 2011b.
5
Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Proc. 4th ACM Conference on Recommender Systems(RecSys'10), pages 39{46, 2010.
6
Gideon Dror, Noam Koenigstein, and Yehuda Koren. Yahoo! music recommendations: Modeling music ratings with temporal dynamics and item taxonomy. In Proc. 5th ACM Conference on Recommender Systems (RecSys'11), 2011.
7
Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proc. 8th IEEE Conference on Data Mining (ICDM'08), pages 263{272, 2008.
8
Michael Jahrer and Andreas Toscher. Collaborative ltering ensemble. In KDD-Cup'11 Workshop, 2011.
9
Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proc. 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'08), pages 426{434, 2008.
10
Siwei Lai, Liang Xiang, Rui Diao, Yang Liu, Huxiang Gu, Liheng Xu, Hang Li, Dong Wang, Kang Liu, Jun Zhao, and Chunhong Pan. Hybrid recommendation models for binary user preference prediction problem. In KDD-Cup'11 Workshop, 2011.
11
Todd G. McKenzie, Chun-Sung Ferng, Yao-Nan Chen, Chun-Liang Li, Cheng-Hao Tsai, Kuan-Wei Wu, Ya-Hsuan Chang, Chung-Yi Li, Wei-Shih Lin, Shu-Hao Yu, Chieh-Yen Lin, Po-WeiWang, Chia-Mau Ni, Wei-Lun Su, Tsung-Ting Kuo, Chen-Tse Tsai, Po-Lung Chen, Rong-Bing Chiu, Ku-Chun Chou, Yu-Cheng Chou, Chien-Chih Wang, Chen-Hung Wu, Hsuan-Tien Lin, Chih-Jen Lin, and Shou-De Lin. Novel models and ensemble techniques to discriminate favorite items from unrated ones for personalized music recommendation. In KDD-Cup'11 Workshop, 2011.
12
Andriy Mnih. Taxonomy-informed latent factor models for implicit feedback. In KDD- Cup'11 Workshop, 2011.
13
Steen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proc. 25th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI'09), pages 452{461, 2009.
14
Harald Steck. Training and testing of recommender systems on data missing not at random. In Proc. 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'10), pages 713{722, 2010.

KDD Cup 2010: Student performance evaluation

This Year's Challenge

How generally or narrowly do students learn? How quickly or slowly? Will the rate of improvement vary between students? What does it mean for one problem to be similar to another? It might depend on whether the knowledge required for one problem is the same as the knowledge required for another. But is it possible to infer the knowledge requirements of problems directly from student performance data, without human analysis of the tasks?

This year's challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting.

Task Description

At the start of the competition, we will provide 5 data sets: 3 development data sets and 2 challenge data sets. Each of the data sets will be divided into a training portion and a test portion, as specified on the Data page. Student performance labels will be withheld for the test portion of each data set. The competition task will be to develop a learning model based on the challenge and/or development data sets, use this algorithm to learn from the training portion of the challenge data sets, and then accurately predict student performance in the test sections. At the end of the competition, the actual winner will be determined based on their model's performance on an unseen portion of the challenge test sets. We will only evaluate each team's last submission of the challenge sets.


Evaluation

You will be allowed to train on the training portion of each data set, and will then be evaluated on your performance at providing Correct First Attempt values for the test portion. We will provide feedback for formatting errors in prediction files, but we will not reveal accuracy on test data until the end of the competition. Note that for each test file you submit, an unidentified portion will be used to validate your data and provide scores for the leaderboard, while the remaining portion will be used for determining the winner of the competition.

For a valid submission, the evaluation program will compare the predictions you provided against the undisclosed true values and report the difference as Root Mean Squared Error (RMSE). If a data set file is missing from a submission, the evaluation program will report the RMSE as 1 for that file. The total score for a submission will then be the average of the RMSE values. All data sets will receive equal weight in the final average, independent of their size.

At the end of the competition, the winner will be the team with the lowest total score.


Challenges

Technical Challenges

In terms of technical challenges, we mention just a few:

  • The data matrix is sparse: not all students are given every problem, and some problems have only 1 or 2 students who completed each item. So, the contestants need to exploit relationships among problems to bring to bear enough data to hope to learn.
  • There is a strong temporal dimension to the data: students improve over the course of the school year, students must master some skills before moving on to others, and incorrect responses to some items lead to incorrect assumptions in other items. So, contestants must pay attention to temporal relationships as well as conceptual relationships among items.
  • Which problems a given student sees is determined in part by student choices or past success history: e.g., students only see remedial problems if they are having trouble with the non-remedial problems. So, contestants need to pay attention to causal relationships in order to avoid selection bias.

Scientific and Practical Importance

From a practical perspective, improved models could be saving millions of hours of students' time (and effort) in learning algebra. These models should both increase achievement levels and reduce time needed. Focusing on just the latter, for the .5 million students that spend about 50 hours per year with Cognitive Tutors for mathematics, let's say these optimizations can reduce time to mastery by at least 10%. One experiment showed the time reduction was about 15% (Cen et al. 2007). That's 5 hours per student, or 2.5 million student hours per year saved. And this .5 million is less than 5% of all algebra-studying students in the US. If we include all algebra students (20x) and the grades 6-11 for which there are Carnegie Learning and Assistment applications (5x), that brings our rough estimate to 250 million student hours per year saved! In that time, students can be moving on in math and science or doing other things they enjoy.

From a scientific viewpoint, the ability to achieve low prediction error on unseen data is evidence that the learner has accurately discovered the underlying factors which make items easier or harder for students. Knowing these factors is essential for the design of high-quality curricula and lesson plans (both for human instructors and for automated tutoring software). So you, the contestants, have the potential to influence lesson design, improving retention, increasing student engagement, reducing wasted time, and increasing transfer to future lessons.

Currently K-12 education is extremely focused on assessment. The No Child Left Behind act has put incredible pressure on schools to "teach to the test", meaning that a significant amount of time is spent preparing and taking standardized tests. Much of the time spent drilling for and taking these tests is wasted from the point of view of deep learning (long-term retention, transfer, and desire for future learning); so any advances which allow us to reduce the role of standardized tests hold the promise of increasing deep learning.

To this end, a model which accurately predicts long-term future performance as a byproduct of day-to-day tutoring could augment or replace some of the current standardized tests: this idea is called "assistment", from the goal of assessing performance while simultaneously assisting learning. Previous work has suggested that assistment is indeed possible: e.g., an appropriate analysis of 8th-grade tutoring logs can predict 10th-grade standardized test performance as well as 8th-grade standardized test results can predict 10th-grade standardized test performance (Feng, Heffernan, & Koedinger, 2009). But it is far from clear what the best prediction methods are; so, the contestants' algorithms may provide insights that allow important improvements in assistment.

Fundamental Questions

If a student is correct at one problem (e.g., "Starting with a number, if I multiply it by 6 and then add 66, I get 81.90. What's the number?") at one time, how likely are they to be correct at another problem (e.g., "Solve for x: 6x+66=81.90") at a later time?

These questions are of both scientific interest and practical importance. Scientifically, relevant deep questions include what is the nature of human knowledge representations and how generally do humans transfer their learning from one situation to another. Human learners do not always represent and solve mathematical tasks as we might expect. You might be surprised if you thought that a student working on the second problem above, the equation 6x+66=81.90, is likely to be correct given that he was correct on the first problem, the story problem. It turns out that most students are able to solve simple story problems like this one more successfully than the matched equation (Koedinger & Nathan, 2004; Koedinger, Alibali, & Nathan, 2008). In other words, there are interesting surprises to be found in student performance data.

Cognitive Tutors for mathematics are now in use in more than 2,500 schools across the US for some 500,000 students per year. While these systems have been quite successful, surprises like the one above suggest that the models behind these systems can be much improved. More generally, a number of studies have demonstrated how detailed cognitive task analysis can result in dramatically better instruction (Clark, Feldon, van Merrienboer, Yates, & Early, 2007; Lee, 2003). However, such analysis is painstaking and requires a high level of psychological expertise. We believe it possible that machine learning on large data sets can reap many of the benefits of cognitive task analysis, but without the great effort and expertise currently required.

References

  • Cen, H., Koedinger, K. R., & Junker, B. (2006). Learning Factors Analysis: A general method for cognitive model evaluation and improvement. In M. Ikeda, K. D. Ashley, T.- W. Chan (Eds.) Proceedings of the 8th International Conference on Intelligent Tutoring Systems, 164-175. Berlin: Springer-Verlag.
  • Clark, R. E., Feldon, D., van Merrienboer, J., Yates, K., & Early, S. (2007). Cognitive task analysis. In J. M. Spector, M. D. Merrill, J. J. G. van Merrienboer, & M. P. Driscoll (Eds.), Handbook of research on educational communications and technology (3rd ed., pp. 577-593). Mahwah, NJ: Lawrence Erlbaum Associates.
  • Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online system that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI). 19(3), pp. 243-266.
  • Koedinger, K. R. & Aleven, V. (2007). Exploring the assistance dilemma in experiments with Cognitive Tutors. Educational Psychology Review, 19 (3): 239-264. Lee, R. L. (2003). Cognitive task analysis: A meta-analysis of comparative studies. Unpublished doctoral dissertation, University of Southern California, Los Angeles, California.
  • Pavlik, P. I., Cen, H., Wu, L.,& Koedinger, K. R. (2008). Using item-type performance covariance to improve the skill model of an existing tutor. In Proceedings of the First International Conference on Educational Data Mining. 77-86.

Competition Rules

Conditions of participation: Anybody who complies with the rules of the challenge (KDD Cup 2010) is welcome to participate. Only the organizers are excluded from participating. The KDD Cup 2010 is part of the competition program of the Knowledge Discovery in Databases conference (KDD 2010), July 25-28 in Washington, DC. Participants are not required to attend the KDD Cup 2010 workshop, which will be held at the conference, and the workshop is open to anyone who registers. The proceedings of the competition will be published in a volume of the JMLR: Workshop and Conference Proceedings series.

Anonymity: All entrants must identify themselves by registering on the KDD Cup 2010 website. However, they may elect to remain anonymous by choosing a nickname and checking the box "Make my profile anonymous". If this box is checked, only the nickname will appear in the Leaderboard instead of the real name. Participant emails will not appear anywhere on the website and will be used only by the organizers to communicate with the participants. To be eligible for prizes, the participants will have to publicly reveal their identity and uncheck the box "Make my profile anonymous".

Teams: To register a team, only register the team leader and choose a nickname for your team. We'll let you know later how to disclose the members of your team. We limit each team to one final entry. As an individual, you cannot enter under multiple names - this would be considered cheating and would disqualify you - nor can you participate under multiple teams. Multiple teams from the same organization, however, are allowed so long as each team leader is a different person and the teams do not intersect. During the development period, each team must have a different registered team leader. To be ranked in the challenge and qualify for prizes, each registered participant (individual or team leader) will have to disclose the names of eventual team members, before the final results of the challenge get released. Hence, at the end of the challenge, you will have to choose to which team you want to belong (only one!), before the results are publicly released. If a person participates in multiple teams, those teams will be disqualified. After the results are released, no change in team composition will be allowed. Before the end of the challenge the team leaders will have to declare the composition of their team. This will have to correspond to the list of co-authors in the proceedings, if they decide to publish their results. Hence a professor cannot have his/her name on all his/her students papers (but can be thanked in acknowledgments).

A team can be either a student team (eligible for student-team prizes) or not a student team (eligible for travel awards). In a student team, a professor should be cited appropriately, but in the spirit of the competition, student teams should consist primarily of student work. We will ask for participants to state whether they are a student team prior to the end of the competition.

Data: Data are available for download from the Data page to registered participants. Each data set is available as a separate archive to facilitate downloading. For viewing accuracy on the Leaderboard, participants may enter results on either or both development and challenge data sets, but results on the development data sets will not count toward the final evaluation.

Challenge duration: The challenge is about 2 months in duration (April 1 - June 8, 2010). To be eligible for prizes, final submissions must be received by June 8 11:59pm EDT (-4 GMT).

On-line feedback: On-line feedback is available through the upload results page and Leaderboard.

Submission method: The method of submission is via the form on the Upload page. To be ranked, submissions must include results on test portion only of the challenge or development data sets. Results on the development data sets will not count as part of the competition. Multiple submissions are allowed.

Evaluation and ranking: For each team, only the last valid entry made by the team leader will count towards determining the winner. Valid entries must include results on both challenge data sets. The method of scoring is described on this page.

Reproducibility: Participation is not conditioned on delivering code nor publishing methods. However, we will ask the top ranking participants to voluntarily fill out a fact sheet about their methods, contribute papers to the proceedings, and help in reproducing their results.

Prizes: Thanks to our sponsors, Facebook, Elsevier, and IBM Research, we will be offering the following prizes to student teams:

Prize amounts increased on April 23, 2010:
First place: $5500
Second place: $3000
Third place: $1500

The Pittsburgh Science of Learning Center (PSLC) will provide the following travel awards to cover expenses related to attending the KDD Cup 2010 workshop at the KDD conference:

Overall first place: $1700
Overall second place: $1150
Overall third place: $650

Student first place: $1700
Student second place: $1150
Student third place: $650

Data Download

Development Data Sets

Data sets Students Steps File
Algebra I 2005-2006 575 813661 algebra_2005_2006.zip
Algebra I 2006-2007 1840 2289726 algebra_2006_2007.zip
Bridge to Algebra 2006-2007 1146 3656871 bridge_to_algebra_2006_2007.zip

 

 

Development data sets are provided for familiarizing yourself with the format and developing your learning model. Using them is optional, and your predictions on these data sets will not count toward determining the winner of the competition. Development data sets differ from challenge sets in that the actual student performance values for the prediction column, "Correct First Attempt", are provided for all steps - see the file ending in "_master.txt".

Challenge Data Sets

Data sets Students Steps File
Algebra I 2008-2009 3,310 9,426,966 algebra_2008_2009.zip
Bridge to Algebra 2008-2009 6,043 20,768,884 bridge_to_algebra_2008_2009.zip

Predictions on challenge data sets will count toward determining the winner of the competition. In each of these two data sets, you'll be asked to provide predictions in the column "Correct First Attempt" for a subset of the steps.


Data Format

Our available data takes the form of records of interactions between students and computer-aided-tutoring systems. The students solve problems in the tutor and each interaction between the student and computer is logged as a transaction. Four key terms form the building blocks of our data. These are problemstep,knowledge component, and opportunity. To more concretely define these terms, we'll use the following scenario:

Using a computer tutor for geometry, a student completes a problem where she is asked to find the area of a piece of scrap metal left over after removing a circular area (the end of a can) from a metal square (Figure 1). The student enters everything in the worksheet except for the row labels and the column and 'Unit' labels for the first three columns.

Figure 1. A problem from Carnegie Learning's Cognitive Tutor Geometry (2005 version).

Problem

A problem is a task for a student to perform that typically involves multiple steps. In the example above, the problem asks the student to find the area of a piece of scrap metal left over after removing a circular area (the end of a can) from a metal square (Figure 1). The row labeled 'Question 1' in the worksheet corresponds to a single problem.

In language domains, such tasks are more often called activities or exercises rather than problems. A language activity, for example, could involve finding and correcting all of the grammatical errors in a paragraph.

Step

A step is an observable part of the solution to a problem. Because steps are observable, they are partly determined by the user interface available to the student for solving the problem. (It is not necessarily the case that the interface completely determines the steps: for example, the student might be expected to create new rows or columns of a table before filling in their entries.)

In the example problem above, the steps for the first question are:

  • find the radius of the end of the can (a circle)
  • find the length of the square ABCD
  • find the area of the end of the can
  • find the area of the square ABCD
  • find the area of the left-over scrap

This whole collection of steps comprises the solution. The last step can be considered the "answer", and the others are "intermediate" steps.

Students might not (and often do not) complete a problem by performing only the correct steps - the student might request a hint from the tutor, or enter an incorrect value. We refer to the actions of a student that is working towards performing a step correctly as transactions. A transaction is an interaction between the student and the tutoring system. Each hint request, incorrect attempt, or correct attempt is a transaction, and each recorded transaction is referred to as an attempt for a step.

In Table 1, transactions have been consolidated and displayed by student and step, producing a step record table. This is the format of the data provided to you in this competition. A step record is a summary of all of a given student's attempts for a given step.

Table 1. Data from the "Making Cans" example, aggregated by student-step

 

Row Student Problem Step
1 S01 WATERING_VEGGIES (WATERED-AREA Q1)
2 S01 WATERING_VEGGIES (TOTAL-GARDEN Q1)
3 S01 WATERING_VEGGIES (UNWATERED-AREA Q1)
4 S01 WATERING_VEGGIES DONE
5 S01 MAKING-CANS (POG-RADIUS Q1)
6 S01 MAKING-CANS (SQUARE-BASE Q1)
7 S01 MAKING-CANS (SQUARE-AREA Q1)
8 S01 MAKING-CANS (POG-AREA Q1)
9 S01 MAKING-CANS (SCRAP-METAL-AREA Q1)
10 S01 MAKING-CANS (POG-RADIUS Q2)
11 S01 MAKING-CANS (SQUARE-BASE Q2)
12 S01 MAKING-CANS (SQUARE-AREA Q2)
13 S01 MAKING-CANS (POG-AREA Q2)
14 S01 MAKING-CANS (SCRAP-METAL-AREA Q2)
15 S01 MAKING-CANS (POG-RADIUS Q3)
16 S01 MAKING-CANS (SQUARE-BASE Q3)
17 S01 MAKING-CANS (SQUARE-AREA Q3)
18 S01 MAKING-CANS (POG-AREA Q3)
19 S01 MAKING-CANS (SCRAP-METAL-AREA Q3)
20 S01 MAKING-CANS DONE

End of Table

Table 1 (cont.). Data from the "Making Cans" example, aggregated by student-step

 

Row Incorrects Hints Error Rate Knowledge component Opportunity Count
1 0 0 0 Circle-Area 1
2 2 1 1 Rectangle-Area 1
3 0 0 0 Compose-Areas 1
4 0 0 0 Determine-Done 1
5 0 0 0 Enter-Given 1
6 0 0 0 Enter-Given 2
7 0 0 0 Square-Area 1
8 0 0 0 Circle-Area 2
9 2 0 1 Compose-Areas 2
10 0 0 0 Enter-Given 3
11 0 0 0 Enter-Given 4
12 0 0 0 Square-Area 2
13 0 0 0 Circle-Area 3
14 0 0 0 Compose-Areas 3
15 0 0 0 Enter-Given 5
16 0 0 0 Enter-Given 6
17 0 0 0 Square-Area 3
18 0 0 0 Circle-Area 4
19 0 0 0 Compose-Areas 4
20 0 0 0 Determine-Done 2

 

End of Table

Knowledge Component

A knowledge component is a piece of information that can be used to accomplish tasks, perhaps along with other knowledge components. Knowledge component is a generalization of everyday terms like concept, principle, fact, or skill, and cognitive science terms like schema, production rule, misconception, or facet.

Each step in a problem requires the student to know something, a relevant concept or skill, to perform that step correctly. In given data sets, each step can be labeled with one or more hypothesized knowledge components needed - see the last column of Table 1 for example KC labels. In line 5 of Table 1, the researcher has hypothesized that the student needs to know CIRCLE-AREA to answer (POGAREA Q1). In line 6, the COMPOSE-AREAS knowledge component is hypothesized to be needed to answer (SCRAP-METAL-AREA Q1).

Every knowledge component is associated with one or more steps. One or more knowledge components can be associated with a step. This association is typically originally defined by the problem author, but researchers can provide alternative knowledge components and associations with steps; together these are known as a Knowledge Component Model.

Opportunity

An opportunity is a chance for a student to demonstrate whether he or she has learned a given knowledge component. A student's opportunity count for a given knowledge component increases by 1 each time the student encounters a step that requires this knowledge component. See the Opportunity Count column of Table 1 for examples.

An opportunity is both a test of whether a student knows a knowledge component and a chance for the student to learn it. While students may make multiple attempts at a step or request hints from a tutor (these are transactions), the whole set of attempts are considered a single opportunity. As a student works through steps in problems, he/she will have multiple opportunities to apply or learn a knowledge component.

For all competition data sets, each record will be a step that contains the following attributes:

  • Row: the row number Update (04-20-2010): for challenge data sets, the row number in each file (train, test, and submission) is no longer taken from the original data set file. Instead, rows are renumbered within each file. So instead of 1...n rows for the training file and n+1..m rows for the test/submission file, it is now 1...n for the training file and 1...n for the test/submission file.
  • Anon Student Id: unique, anonymous identifier for a student
  • Problem Hierarchy: the hierarchy of curriculum levels containing the problem.
  • Problem Name: unique identifier for a problem
  • Problem View: the total number of times the student encountered the problem so far.
  • Step Name: each problem consists of one or more steps (e.g., "find the area of rectangle ABCD" or "divide both sides of the equation by x"). The step name is unique within each problem, but there may be collisions between different problems, so the only unique identifier for a step is the pair of problem_name and step_name.
  • Step Start Time: the starting time of the step. Can be null.
  • First Transaction Time: the time of the first transaction toward the step.
  • Correct Transaction Time: the time of the correct attempt toward the step, if there was one.
  • Step End Time: the time of the last transaction toward the step.
  • Step Duration (sec): the elapsed time of the step in seconds, calculated by adding all of the durations for transactions that were attributed to the step. Can be null (if step start time is null).
  • Correct Step Duration (sec): the step duration if the first attempt for the step was correct.
  • Error Step Duration (sec): the step duration if the first attempt for the step was an error (incorrect attempt or hint request).
  • Correct First Attempt: the tutor's evaluation of the student's first attempt on the step - 1 if correct, 0 if an error.
  • Incorrects: total number of incorrect attempts by the student on the step.
  • Hints: total number of hints requested by the student for the step.
  • Corrects: total correct attempts by the student for the step. (Only increases if the step is encountered more than once.)
  • KC(KC Model Name): the identified skills that are used in a problem, where available. A step can have multiple KCs assigned to it. Multiple KCs for a step are separated by ~~ (two tildes). Since opportunity describes practice by knowledge component, the corresponding opportunities are similarly separated by ~~.
  • Opportunity(KC Model Name): a count that increases by one each time the student encounters a step with the listed knowledge component. Steps with multiple KCs will have multiple opportunity numbers separated by ~~.
  • Additional KC models, which exist for the challenge data sets, will appear as additional pairs of columns (KC and Opportunity columns for each model).

For the test portion of the challenge data sets, values will not be provided for the following columns:

  • Step Start Time
  • First Transaction Time
  • Correct Transaction Time
  • Step End Time
  • Step Duration (sec)
  • Correct Step Duration (sec)
  • Error Step Duration (sec)
  • Correct First Attempt
  • Incorrects
  • Hints
  • Corrects

The competition will use 5 data sets (3 development data sets and 2 challenge data sets) from 2 different tutoring systems. These data sets come from multiple schools over multiple school years. The systems include the the Carnegie Learning Algebra system, deployed 2005-2006 and 2006-2007, and the Bridge to Algebra system, deployed 2006-2007. The development data sets have previously been used in research and are available through the Pittsburgh Science of Learning Center DataShop (as well as this website). The challenge data sets will come from the same 2 tutoring systems for subsequent school years; the challenge data sets have not been made available to researchers prior to the KDD Cup.

Each data set will be broken into two files, a training file and a test file. A third file, a submission file, will be provided for submitting results. The submission file will contain a subset of the columns in the test file.

Each data set will be split as follows:

In the diagram above, each horizontal line represents a student-step (a record of a student working on a step.) The data set is broken down by student, unit (a classification of a portion of the math curriculum hierarchy, e.g., "Linear Inequality Graphing"), section (a portion of the curriculum that falls within a unit, e.g., "Section 1 of 3"), and problem.

Test rows are determined by a program that randomly selects one problem for each student within a unit, and places all student-step rows for that student and problem in the test file. Based on time, all preceding student-step rows for the unit will be placed in a training file, while all following student-step rows for that unit will be discarded. The goal at testing time will be to predict whether the student got the step right on the first attempt for each step in that problem. Each prediction will take the form of a value between 0 and 1 for the column Correct First Attempt.

For each test file you submit, an unidentified portion will be used to validate your data and provide scores for the leaderboard, while the remaining portion will be used for determining the winner of the competition.

How to format and ship results

To submit your results, you must return the results in a separate text file for each data set, using the same filename as the downloaded submission file. These text files must be grouped into one archive. You can upload either results for the development data sets or the challenge data sets, but not both at the same time. You may submit results on a subset of data sets to get online feedback.

Each submission file will contain two columns:

  • Row: the row number, as carried over from the original data set file. Update (04-20-2010): for challenge data sets, the row number in each file (train, test, and submission) is no longer taken from the original data set file. Instead, rows are renumbered within each file. So instead of 1...n rows for the training file and n+1..m rows for the test/submission file, it is now 1...n for the training file and 1...n for the test/submission file.
  • Correct First Attempt: your prediction value, a decimal number between 0 and 1, that indicates the probability of a correct first attempt for this student-step.

The upload page will reject submissions that are missing test rows, provide test rows out of order, contain duplicate test rows, or contain row values that are not in the original test file. It will also reject submissions that provide values in the Correct First Attempt column that are not a decimal between 0 and 1. Predictions must be included for all rows. If errors are detected in the file, the upload page will provide feedback as to where in the file(s) the errors are.

To read about the evaluation process, see the Tasks page.

Results

Winners of KDD Cup 2010: All Teams
  • First Place: National Taiwan University
    Feature engineering and classifier ensembling for KDD CUP 2010
  • First Runner Up: Zhang and Su
    Gradient Boosting Machines with Singular Value Decomposition
  • Second Runner Up: BigChaos @ KDD
    Collaborative Filtering Applied to Educational Data Mining

Winners of KDD Cup 2010: Student Teams
  • First Place: National Taiwan University
    Feature engineering and classifier ensembling for KDD CUP 2010
  • First Runner Up: Zach A. Pardos
    Using HMMs and bagged decision trees to leverage rich features of user and skill
  • Second Runner Up: SCUT Data Mining
    Split-Score-Predicate

Full Results: All Teams
 
Rank Team Name Cup Score Leaderboard Score
1 National Taiwan University 0.272952 0.276803
2 Zhang and Su 0.273692 0.27679
3 BigChaos @ KDD 0.274556 0.279046
4 Zach A. Pardos 0.27659 0.279695
5 Old Dogs With New Tricks 0.277864 0.281163
6 SCUT Data Mining 0.280476 0.284624
7 pinta 0.28455 0.2892
8 DMLab 0.285977 0.291296
9 FEG 0.288764 0.293141
10 FEG-K 0.289877 0.294338
11 Vadis 0.292382 0.297136
12 psweather 0.292794 0.297542
13 uq 0.295389 0.301525
14 EXL 0.29619 0.302679
15 Y10 0.298006 0.304277
16 Shiraz 0.299574 0.31043
17 UniQ2 0.300009 0.307777
18 Andreas von Hessling 0.300228 0.310936
19 Baby 0.301864 0.311978
20 grandprix 0.309506 0.316712
21 Atlantis 0.309573 0.316246
22 Green Ensemble 0.311678 0.324581
23 Troae 0.314485 0.326375
24 pozip@FRI 0.3157 0.32501
25 Heymar 0.323107 0.335412
26 MiloBing 0.344566 0.361959
27 DataKiller 0.344611 0.360974
28 ecnusei07 0.34479 0.362294
29 Kun Liu 0.455219 0.460331
 

 
 
 
 
 
 
 
 
 
 
End of Table
Full Results: Student Teams
Rank Team Name Cup Score Leaderboard Score
1 National Taiwan University 0.272952 0.276803
2 Zach A. Pardos 0.27659 0.279695
3 SCUT Data Mining 0.280476 0.284624
4 Y10 0.298006 0.304277
5 Shiraz 0.299574 0.31043
6 UniQ2 0.300009 0.307777
7 Baby 0.301864 0.311978
8 Atlantis 0.309573 0.316246
9 Green Ensemble 0.311678 0.324581
10 Troae 0.314485 0.326375
11 pozip@FRI 0.3157 0.32501
12 Heymar 0.323107 0.335412
13 MiloBing 0.344566 0.361959
14 ecnusei07 0.34479 0.362294
 

 

 

 
 
 
 
 
End of Table

Post-Competition Questions

What do the 1's and 0's mean on the leaderboard for development data sets?

This is a bug that was an effect of adding "cup scoring" and "leaderboard scoring". You can ignore the "cup scoring" for development data sets; "leaderboard scoring" is the only meaningful score for development data sets, and it's a score based on the entire submission.

Will you be making the Challenge Test Set labels available now that the contest has ended?

We have just re-opened the submissions and leaderboard to allow people to continue working on their algorithms. Sometime later, we hope to release the full master data sets, but have no definite plans at this time.

Why are there now two alternate views of the Leaderboard?

During the KDD Cup Workshop, some participants suggested that we change the way the leaderboard works so that we display the same type of scores that were used to determine the competition winners (by validating most of the predictions instead of a small portion). Recall that during the competition, the evaluation process that powered the Leaderboard only looked at a small portion of participant prediction files; a much larger portion was used to determine the winners. We thought this suggestion was a great idea, but we didn't want to change the Leaderboard and individual submission pages in such a way that they no longer reflected the scores and ranking given to participants at the end of the competition. As a solution, we've added a toggle to the Leaderboard and individual submission pages allowing you to view either the Cup Scoring, where a majority of the prediction file is used to score the entry, or Leaderboard Scoring, where a small portion of the prediction file is used to score the entry. We calculate both, so you can toggle between them. Cup Scoring is more accurate, but Leaderboard scoring is what was used during the competition to power the Leaderboard.


Registration and General Competition

The leaderboard just went blank, and it says there are no submissions. Is it safe to submit?

Yes, it is still safe to submit, and the leaderboard has not been lost. There is a bug in our system that is causing the leaderboard to appear empty under high load. The submission and the leaderboard data are still stored properly.

I didn't get a verification email. What should I do?

Contact us and tell us that you registered but received no email. We'll verify your account. Please do not create a new account.

Where can I find more info about the competition?

Professors at WPI want to help undergrads compete. Professors Neil Heffernan, Ryan Baker, Joe Beck, and Carolina Ruiz of Worcester Polytechnic Institute will be giving online webinar lectures at WPI and posting them online for undergrads across the country. They will give suggestions on how to do well on the KDD Cup this year. They're also going to give away some tools. See their website for more info: http://teacherwiki.assistment.org/wiki/index.php/KDD

When can I register and download data?

You can register April 1, 2010 at 2pm EDT, and download data as soon as you've registered.

Can team members come from different organizations?

Team members can definitely come from different organizations. The only restriction on teams is that any person can only participate in a single team. See the Rulesfor more information about participation.

What's the difference between a student team and a non-student team?

A team can be either a student team (eligible for student-team prizes) or not a student team (eligible for travel awards). In a student team, a professor should be cited appropriately, but in the spirit of the competition, student teams should consist primarily of student work. We will ask for participants to state whether they are a student team prior to the end of the competition. Our sponsors have provided cash prizes for student teams.

Does this mean there won't be any monetary prizes for industry participants other than travel awards?

Rather than splitting the prizes among student and non-student teams, we decided that we will only award monetary prizes to students this year, since the cash prize would make the biggest difference for them. This way we can offer more consistent cash prizes, while still offering a sizable travel award for all top placed participants. For an industry participant, we figure that being a top performer in the KDD Cup is worth way more in PR than the cash prize anyway. Plus, our experience has been that, in some cases, providing cash prizes to industry participants creates more problems on their side than it's worth. We hope the lack of prizes for industry won't keep you from participating in the KDD Cup this year!


Data Format

What are the differences between development and challenge data sets?

Development data sets are provided for familiarizing yourself with the format and developing your learning model. Using them is optional, and your predictions on these data sets will not count toward determining the winner of the competition. Development data sets differ from challenge sets in that the actual student performance values for the prediction column, "Correct First Attempt", are provided for all steps - see the file ending in "_master.txt".

Challenge data sets will be made available on April 8 at 2pm EDT. Predictions on these data sets will count toward determining the winner of the competition. In each of these two data sets, you'll be asked to provide predictions in the column "Correct First Attempt" for a subset of the steps. For more information on which steps these will be, see the bottom of our Data page.

What are the different files in each data set ZIP file I downloaded?

[dataset]_train.txt
This file contains all training rows for the data set, but no test rows. Use this file to train your learning model.
[dataset]_test.txt
This file contains only test rows for the data set. You will make predictions for all of these rows in the column Correct First Attempt, but submit only the columnsRow and Correct First Attempt - see the submission file format, [dataset].txt, below.
[dataset]_submission.txt
A submission file composed of only 2 columns, Row and Correct First Attempt. This shows the format of a valid submission file in which you'd provide prediction values (probabilities) in the second column. Note that only test rows are included, not training rows. (For development sets, this file is just [dataset].txt)
[dataset]_master.txt
Only included for development data sets, this file shows all of the actual student performance values for test rows in the prediction column, Correct First Attempt. It also includes values for the various time columns and Incorrects/Hints/Corrects columns, which are empty in the test file. These are documented on the Datapage. It's up to you how you'd like to use the development data sets and the included actual student performance values.

Your description of the data format states that there should be one test problem (composed of multiple step rows) per student and unit, but I see more than one problem per student-unit in a development data set. I also see only a single test row per student-unit, not the multiple rows you describe. Why?

This are some errors in the development data sets. We might update the development data sets to fix this issue, but we will make sure this doesn't happen in the challenge data sets: expect only one problem per student-unit in any challenge data set, composed of one or more step rows. The purpose of releasing development data sets was to find any issues in the data and to allow participants to familiarize themselves with the data structure.

The file algebra_2006_2007_train.txt appears to end on a partial record that has only 11 fields and no line terminator.

This is a bug in that development data set. We plan to update this development data set. - April 5, 2010 2:15pm

Update: An updated version of the "Algebra 2006-2007" data set with a fixed final row is available on the Data. - April 5, 2010 4:45pm

Do the steps for a given problem have to be completed in a unique order?

No. For a given problem, the order of the steps completed could vary across students or it could be the same. The set of steps completed could also vary or be the same—in some problems, some steps are optional. Also note that for some steps, the correct answer the student enters could vary across students (e.g., the tutor might have accepted "35" and "35.0", or something more dissimilar).

I see that there are some problems which have different number of steps when attempted by different students. For example, in "Unit ES_05", problem "EG61" has been attempted by multiple students but each time the number of steps differs. Does the number of steps in a given instantiation of a problem depend on the performance of the student on the earlier steps of the same problem or even previously attempted problems? Or is it "pre-decided" by the tutor program (for a given instantiation) and a student has to just attempt the steps sent forth by it?

The number of steps you see for a problem could vary across students based on performance within a problem (e.g., the tutor might present extra steps based on errors made) or just based on the approach the student took (e.g., the student might skip optional steps). I don't know of an example where performance on a prior problem affects the number of steps in a later problem, in the sense that the tutor would control access to steps. When looking at solver or grapher data (the unit referenced is an equation solver unit), the number of steps almost always varies because these are somewhat exploratory environments where non-useful steps (steps that aren't closer to the solution) are allowed. So with respect to steps, some tutored problems are more pre-decided than others.

One of the fields is Corrects which is described as "total correct attempts by the student for the step. (Only increases if the step is encountered more than once.)". Why would the same step come up more than once in a problem?

This can happen in a few cases. One case is where the tutoring software allows the student to enter something and receive feedback, but does not lock the widget upon receiving the correct response. An example is a problem that includes an interactive graph where the student can set the scale of the graph: they can set the scale of the graph as many times as they'd like so long as they provide valid numbers. Each valid attempt is a "correct" one. Another case is in equation solving problems. In these problems, the student could transform the equation any number of times. The "step" is represented as the current equation, so if a transformation leads to an equation they've seen already, the number of "corrects" would increase by 1. Other cases probably exist in the data.

Is each step uniquely identified by problem hierarchy, problem name, and step name? Or is just problem name and step name needed?

A step row is uniquely identified by this hierarchy:
Student > Problem Hierarchy > Problem Name > Problem View > Step

Update: A bug in the data prevents this from always being the case. - May 7, 2010 12:00pm

How should I interpret these long knowledge components I often see? For example:

[SkillRule: Eliminate Parens; 
{CLT nested; CLT nested, parens;
Distribute Mult right;
Distribute Mult left;
(+/-x ±a)/b=c, mult;
(+/-x ±a)*b=c, div;
[var expr]/[const expr] = [const expr], multiply;
Distribute Division left;
Distribute Division right;
Distribute both mult left;
Distribute both mult right;
Distribute both divide left;
Distribute both divide right;
Distribute subex}]

The skill here is "Eliminate Parens". The items listed in the curly braces {CLT nested; CLT nested, parens; Distribute Mult right; ... etc.} is just a static list of every operation that could possibly trigger this skill. The order of things is not at all meaningful. There is no hierarchical or sequential structure here; it is simply how it was listed by the tutor developer. In fact, any element in this list may not be relevant to the particular problem the student was operating on. This is simply a list of all possible operations that might possibly be involved in triggering the skill. In this case, there are a lot of operations which could trigger the "Eliminate Parens" skill.

Can you tell me more about the KC models in the challenge data sets?

The extra columns mean that for the challenge data sets, there are additional KC models. A KC model is a list of mappings between each step and one or more knowledge components; it is also known as a Transfer Model or a Skill Model. Each KC model is represented by two columns (KC and Opportunity). Within a model, there can be multiple KCs associated with a step. When there are, they are separated by "~~". The corresponding opportunity numbers are also separated by "~~", and are given in the same order, so each KC has an opportunity count for that step.

A simple answer about these individual models is that Rules is a more fine-grained categorization of similar steps and KTracedSkills is a more coarse-grained categorization. ("Rules" corresponds with the production rules in the tutor, which get grouped into "meta-productions" that correspond 1:1 with the KTracedSkills. I believe there is a strict one-to-many mapping between each of the KTracedSkills and instances of the Rules (but there may be exceptions).) The KTracedSkills level is used by the tutor to select future problems and is presumed to be the level at which students are learning and transferring their knowledge from one task experience (step) to another related one. That presumption is not always born out by the data and the fine-grained Rules KC model may provide clues to a better clustering of steps to predict transfer of learning.

To summarize in a slightly different way:

KTracedSkills - these are the skills that are knowledge-traced (i.e. the ones that appear on the tutor's skillometer).

SubSkills - these are the skills identified by the system, whether or not they are being traced.

Rules - these are the actual rule names used to determine skills. The distinction between these and "SubSkills" is that this KC model would include the actual model rules, not the meta-productions, while the SubSkills would be based on meta-productions.

What does it mean for the current step's end time to be later than the start time of the following step?

In general, steps can be interleaved: a student can start working on one step, complete another step, and return to the first step. In rows 81 and 82 of algebra_2005_2006_train.txt, this explanation isn't a great fit because row 81 had no incorrect attempts, meaning that somehow the student "started" the FinalAnswer step, and then, 105 seconds later, completed it. But perhaps FinalAnswer is representing a class of "final answers", since there are 4 "Corrects" for that step. That would mean that a step called FinalAnswer was available for the student to solve correctly 4 times (which they did without error). See also our FAQ entry where we mention "equation solving problems".

Can you give us some suggestions about how to understand some step names? (E.g. "XR1" and "R7C1")

R7C1 is "row 7 column 1" and indicates where this student input appeared within the table that students are working within. We're not sure about XR1, but it may be better inferred from the other step names in the same problem.

Is there any manual of the tutor system? It will be very helpful for us to understand the system and data sets as we have never used this tutoring system before.

To learn more about the tutor, see some of the papers about the Algebra tutor and some about relevant units in the Bridge to Algebra tutor. These include the following:

Ritter, Steven; Haverty, Lisa; Koedinger, Kenneth; Hadley, William; Corbett, Albert (2008). Integrating intelligent software tutors with the math classroom. G. Blume and K. Heid (Eds.), Research on Technology and the Teaching and Learning of Mathematics: Vol. 2 Cases and Perspectives. Charlotte, NC: IAP. [PDF]

Koedinger, K. R. & Aleven, V. (2007). Exploring the assistance dilemma in experiments with Cognitive Tutors. Educational Psychology Review, 19 (3): 239-264. [PDF]

You can use forward and backward pointers to and from the references in these papers to find other potentially relevant papers - so can looking at related web sites like learnlab.orgcarnegielearning.compact.cs.cmu.edu/koedinger/koedingerCV.html.

How can I verify the integrity of the files I've downloaded?

You can use the Unix program sha1sum and the following SHA-1 hashes to verify the challenge files:

d6907b97e675248c86683098f1075696b5d1a17d*algebra_2008_2009.zip
7134e3b44af538a55d53f15463b4db50abb191c2 *bridge_to_algebra_2008_2009.zip

Another way to verify the files is to count the number of lines in the training and test files, add them together, and compare this to the number of steps reported on the downloads page. You can use a Unix command like wc -l <filename> to count the number of lines in each file. Note that the number of steps we report does not include the header rows, so your number is likely to be greater by 2 (one header row in each file).

Why did you change the row numbering scheme for the challenge data sets?

This change was intentional. Our primary concern was that the numbering in the development sets left gaps between the student-units such that it was possible to infer how many more steps the student took to finish the unit by looking at the first row number of the next student-unit. This numbering scheme eliminates that possibility.

More specifically, it is the gap between the test set of Student-Unit N and the training set for Student-Unit N+1 that is the give-away clue. An example is shown below.

THE OLD NUMBERING SCHEME 
(the gap between 120 and 150 is the give-away clue):
Rows     Student   Unit  Data-set
1 - 100  Student1  Unit1 Training
101-120  Student1  Unit1 Test
150-250  Student1  Unit2 Training
251-265  Student1  Unit2 Test

THE NEW NUMBERING SCHEME:
Rows     Student   Unit  Data-set
1 - 100  Student1  Unit1 Training
1-20     Student1  Unit1 Test
101-201  Student1  Unit2 Training
21-35    Student1  Unit2 Test

THE DERIVABLE TEMPORAL ORDERING 
(but with no between-unit gap information):
Rows     Student   Unit  Data-set
1 - 100  Student1  Unit1 Training
101-120  Student1  Unit1 Test
121-221  Student1  Unit2 Training
222-236  Student1  Unit2 Test

 

Known Noise in the Data

In Algebra I 2006-2007, the knowledge components of some steps are ")]". Are they correct?

No, this is incorrect. In addition, there are other KC names that are cut off in similar ways. We will fix this, but since this issue is only present in the development sets, it is not a priority.

In Algebra I 2006-2007, some KCs have substring "(null~null)". For example:

[Rule: action after done expr (null~~null)]~~[Rule: 
[SolverOperation subtract] in expr ([SolverOperation subtract]

These names are not a problem: The two arguments here are the "expected input" for the rule, and the actual user input. Checking the user input against the expected input is part of how the rule gets evaluated. In these cases, because they are "any action" then any input is "expected" so the expected input is null. In most of these cases the user input seems to be null also, except for the "left" indicating that the user specified the lefthand-side of the equation.

You write that "a step row is uniquely identified by this hierarchy: Student > Problem Hierarchy > Problem Name > Problem View > Step. If that's true, then why do I see duplicates for this key?

It is true that there are duplicate (not unique) keys in the challenge data sets. (We haven't analyzed the development data sets, but they probably exist there too.) We'd classify this issue as noise in the data; there should not be duplicate keys, but there are. The following breakdown shows the extent of these duplicate keys in the training files of the two challenge data sets:

 

Algebra I 2008-2009 training Bridge to Algebra 2008-2009 training
8,918,055 keys 20,012,499 keys
8,915,724 unique keys 20,012,362 unique keys
2,331 duplicate keys 137 duplicate keys
0.0261% percentage duplicates of total 0.0007% percentage duplicates of total

 

 

 

 

 

The most times a key is duplicated is 6 times.

In the test data, we found some rows with Problem View values greater than 1. Since Problem View is "the total number of times the student encountered the problem so far" and each test row is "determined by a program that randomly selects one problem for each student within a unit", how could Problem View be larger than 1?

When we created the test files, we didn't take "problem view" into account, so the result is that you may see more than one problem instance for a single unit-student pair in the test file. This was unintended, but you'll need to account for it in your prediction algorithm.

The presence of more than one problem for a student-unit pair in the test file will always take the form of all steps completed within another instance of that same problem. You can use the order of the rows in the test file and the problem view values to determine relatively when the student worked on the same problem again. Note that at the boundary between problem views - at the point where problem view increases - the steps surrounding the boundary are not necessarily contiguous: a student could have worked on the same problem much later in time.

In algebra_2008_2009_train.txt, I find that almost every line has the same start and end time. The step duration is also equal to 0. Is that normal?

Unfortunately, this is the case for all challenge data set training files. It is an issue in the raw data, and as such, we will not be able to fix it. It's not known whether the timestamp that is duplicated across all four time columns is the step start time, step end time, or something else. As we will not be able to correct this issue in the data, you will need to take it into account.

We've seen duplicate knowledge components assigned to one step. For example, in algebra_2008_2009_train.txt, row 2569, the KC is defined as 
LABEL-X-HELP~~LABEL-X-HELP and the opportunity numbers for those KCs as 62~~1. Shouldn't the opportunity count just increase over time?

This is a type of noise in the data. We don't know why it occurs. You should ignore the second KC that is listed, as well as its opportunity number. You are right that for a student, the KC opportunity count should increase as time goes on and the student encounters steps with that KC.

Can we say, for the same student, the records along the row ID are listed along a time index, i.e., for the same student, a record with a larger row ID happens after a record with a smaller row ID?

No, that's not a safe assumption for a couple of reasons. First, a clarification about the format: a student-step record is a summary of one or more student attempts ("transactions") on a step, each with their own time stamp. This means that one can't refer to a row as happening at one time. Secondly, attempts at different steps can be interleaved. Therefore, there is no guarantee that a student finishes one step before starting another one.

We can clarify the order of student-step rows in either a training or test file. The order of these student-step rows is determined by the following rule: for a given instance of student working on a problem, order the rows by a field called "step time", which is the time of the first correct transaction ("Correct Transaction Time", where given) or, if there is no correct transaction, the time of the last transaction on the step ("Step End Time"). We don't display this "step time" value but we use it to order rows.

If the data is noisy, however, such as in Algebra 2008-2009 where the same time is given across all time fields for the step AND all steps in the problem, then the sorting within a problem is indeterminate: we can't say what order it is in since we've tried to sort rows based on identical criteria. Similarly, ordering problems for a student could also be random if the same step time is used for more than one problem.

Contacts

KDD Cup 2010 Chairs
  • John Stamper
    PSLC DataShop Technical Director, Carnegie Mellon University
  • Alexandru Niculescu-Mizil
    Herman Goldstine Postdoctoral Fellow at IBM T.J. Watson Research Lab

Please visit the original KDD Cup 2010 website for more information.

KDD Cup 2009: Customer relationship prediction

This Year's Challenge

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

Task Description

The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.

  • Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.
  • Appetency: In our context, the appetency is the propensity to buy a service or a product.
  • Up-selling (wikipedia definition): Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Up-selling usually involves marketing more profitable services or products, but up-selling can also be simply exposing the customer to other options he or she may not have considered previously. Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.

Evaluation

The performances are evaluated according to the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling). This is what we call "Score" in the Result page.

Sensitivity and specificity

The main objective of the challenge is to make good predictions of the target variables. The prediction of each target variable is thought of as a separate classification problem. The results of classification, obtained by thresholding the prediction score, may be represented in a confusion matrix, where tp (true positive), fn (false negative), tn (true negative) and fp (false positive) represent the number of examples falling into each possible outcome:

 
Prediction
Class +1
Class -1
Truth
Class +1
tp
fn
Class -1
fp
tn

 

Any sort of numeric prediction score is allowed, larger numerical values indicating higher confidence in positive class membership.

We define the sensitivity (also called true positive rate or hit rate) and the specificity (true negative rate) as:

  • Sensitivity = tp/pos
  • Specificity = tn/neg

where pos = tp+fn is the total number of positive examples and neg=tn+fp the total number of negative examples.

AUC

The results will be evaluated with the so-called Area Under Curve (AUC). It corresponds to the area under the curve obtained by plotting sensitivity against specificity by varying a threshold on the prediction values to determine the classification result. The AUC is related to the area under the lift curve and the Gini index used in marketing (Gini = 2 AUC -1). The AUC is calculated using the trapezoid method. In the case when binary scores are supplied for the classification instead of discriminant values, the curve is given by {(0,1), (tn/(tn+fp), tp/(tp+fn)), (1,0)} and the AUC is just the Balanced ACcuracy BAC.

Competition Rules

Conditions of participation: Anybody who complies with the rules of the challenge (KDDcup 2009) is welcome to participate. Only the organizers are excluded from participating. The KDDcup 2009 is part of the competition program of the Knowledge Discovery in Databases conference (KDD 2009), Paris June 28-July 1st, 2009. Participants are not required to attend the KDDcup 2009 workshop, which will be held at the conference, and the workshop is open to anyone who registers. The proceedings of the competition will be published by the Journal of Machine Learning Research Workshop and Conference Proceedings (JMLR WC&P).

Anonymity: All entrants must identify themselves by registering on the KDDcup 2009 website. However, they may elect to remain anonymous by choosing a nickname and checking the box "Make my profile anonymous". If this box is checked, only the nickname will appear in the result tables instead of the real name. Participant emails will not appear anywhere on the website and will be used only by the organizers to communicate with the participants. To be eligible for prizes the participants will have to publicly reveal their identity and uncheck the box "Make my profile anonymous".

Data: The dataset is available for download from the Data page to registered participants. The data are available in several archives to facilitate downloading and two versions are made available ("small" with 230 variables, and "large" with 15,000 variables). The participants may enter results on either or both versions, which correspond to the same data entries, the 230 variables of the small version being just a subset of the 15,000 variables of the large version. Both training and test data are available without the true target labels. For practice purpose, "toy" training labels are available together with the training data from the onset of the challenge in the fast track. The results on toy targets (T) will not count for the final evaluation. The real training labels of the tasks "churn" (C), "appetency" (A), and "up-selling" (U), will be made available for download separately half-way through the challenge.

Challenge duration and tracks: The challenge starts March 10, 2009 and ends May 11, 2009. There are two challenge tracks:

  • FAST (large) challenge: Results submitted on the LARGE dataset within five days of the release of the real training labels will count towards the fast challenge.
  • SLOW challenge: Results on the small dataset and results on the large dataset not qualifying for the fast challenge, submitted before the KDDcup 2009 deadline May 11, 2009, will count toward the SLOW challenge.

If more than one submission is made in either track and with either dataset, the last submission before the track deadline will be taken into account to determine the ranking of participants and attribute the prizes. You may compete in both tracks. There are prizes in both tracks.

On-line feed-back: During the challenge, the training set performances will be available on the Results page as well as partial information on test set performances: The test set performances on the toy task (T) and performances on a fixed 10% subset of the test examples for the real tasks (C, A, U). After the challenge is over, the performances on the whole test set will be calculated and substituted in the result tables.

Submission method: The method of submission is via the form on the Submission page. To be ranked, submissions must comply with the Instructions. A submission should include results on both training and test set on at least one of the tasks (T, C, A, U), but it may include results on several tasks. A submission will be considered "complete" and eligible for prizes if it contains 6 files corresponding to training and test data predictions for the tasks C, A, and U, either for the small or for the large dataset (or for both). Results on the practice task T will not count as part of the competition. If you encounter problems with the submission process, please contact the Challenge Webmaster. Multiple submissions are allowed, but please limit yourself to 5 submissions per day maximum. For your final entry in the slow track, you may submit results on either or both small and large datasets in the same archive (hence you get 2 chances of winning).

Evaluation and ranking: For each entrant, only the last valid entry will count towards determining the winner in each track (fast and slow). We limit each participating person to a single final entry in each track (see the FAQs page for the conditions under which you can work in teams). Valid entries must include results on all three real tasks. The method of scoring is posted on the Tasks page. Prizes will be attributed only to entries performing better than the baseline method (Naive Bayes). The results of the baseline method are provided in the Result page. These are not the best results obtained by the organization team at Orange, they are easy to outperform, but difficult to attain by chance.

Reproducibility: Participation is not conditioned on delivering code nor publishing methods. However, we will ask the top ranking participants to voluntarily fill out a fact sheet about their methods, contribute papers to the proceedings, and help reproducing their results.

Data Download

Training and test data matrices and practice target values

The large dataset archives are available since the onset of the challenge. The small dataset will be made available at the end of the fast challenge. Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical. Toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation.

Small version (230 var.):

Large version (15,000 var.):

Toy targets (large):

True task labels

Real binary targets (small):

Real binary targets (large):


Data Format

The datasets use a format similar as that of the text export format from relational databases:

  • One header lines with the variables names
  • One line per instance
  • Separator tabulation between the values
  • There are missing values (consecutive tabulations)

The large matrix results from appending the various chunks downloaded in their order number. The header line is present only in the first chunk.

The target values (.labels files) have one example per line in the same order as the corresponding data files. Note that churn, appetency, and up-selling are three separate binary classification problems. The target values are +1 or -1. We refer to examples having +1 (resp. -1) target values as positive (resp. negative) examples.

The Matlab matrices are numeric. When loaded, the data matrix is called X. The categorical variables are mapped to integers. Missing values are replaced by NaN for the original numeric variables while they are mapped to 0 for categorical variables.

Results

Winners of KDD Cup 2009: Fast Track
  • First Place: IBM Research
    Ensemble Selection for the KDD Cup Orange Challenge
  • First Runner Up: ID Analytics, Inc
    KDD Cup Fast Scoring on a Large Database
  • Second Runner Up: Old dogs with new tricks (David Slate, Peter W. Frey)

Winners of KDD Cup 2009: Slow Track
  • First Place: University of Melbourne
    University of Melbourne entry
  • First Runner Up: Financial Engineering Group, Inc. Japan
    Stochastic Gradient Boosting
  • Second Runner Up: National Taiwan University, Computer Science and Information Engineering
    Fast Scoring on a Large Database using regularized maximum entropy model,
    categorical/numerical balanced AdaBoost and selective Naive Bayes

Full Results: Fast Track
Rank Team Name Method AUC
Churn Appetency Upselling Score
1 IBM Research Final Submission 0.7611 0.8830 0.9038 0.8493
2 ID Analytics, Inc DT 0.7565 0.8724 0.9056 0.8448
3 Old dogs with new tricks Our own method 0.7541 0.8740 0.9050 0.8443
4 Crusaders Joint Score Technique 0.7569 0.8688 0.9034 0.8430
5 Financial Engineering Group, Inc. Japan boosting 0.7498 0.8732 0.9057 0.8429
6 LatentView Analytics Boosting 0.7579 0.8670 0.9034 0.8428
7 Data Mining Logistic 0.7580 0.8659 0.9034 0.8424
8 StatConsulting (K.Ciesielski, M.Sapinski, M.Tafil) AdvancedMiner 0.7544 0.8723 0.8997 0.8421
9 Sigma Decision Tree Algo 0.7568 0.8644 0.9034 0.8415
10 Analytics CART 0.7564 0.8644 0.9034 0.8414
11 Ming Li & Yuwei Zhang me 0.7507 0.8683 0.9050 0.8413
12 Hungarian Academy of Sciences fri4 0.7496 0.8683 0.9042 0.8407
13 Oldham Athletic Reserves tiberius10 0.7492 0.8699 0.9026 0.8406
14 Swetha Logistic 0.7550 0.8659 0.8996 0.8401
15 VladN vnf8c 0.7415 0.8692 0.9012 0.8373
16 VADIS Bagging 0.7474 0.8631 0.8994 0.8366
17 brendano random forests (res11) 0.7468 0.8627 0.9003 0.8366
18 commendo 1 before noon 0.7381 0.8693 0.8988 0.8354
19 FEG CTeam Boosting 0.7389 0.8616 0.9011 0.8338
20 Vadis Team 2 Best final 0.7442 0.8568 0.8996 0.8335
21 National Taiwan University, Computer Science and Information Engineering all2 0.7428 0.8679 0.8890 0.8332
22 Kranf TIM 0.7463 0.8478 0.8980 0.8307
23 Neo Metrics final2 0.7454 0.8449 0.8994 0.8299
24 ooo 10-3 0.7427 0.8520 0.8920 0.8289
25 TonyM mymethod5 0.7397 0.8481 0.8988 0.8289
26 AIIALAB ensemble 0.7413 0.8458 0.8969 0.8280
27 Uni Melb hfinal 0.7087 0.8669 0.8996 0.8251
28 Christian Colot My GoldMiner 0.7183 0.8577 0.8958 0.8240
29 Céline Theeuws final 0.7346 0.8476 0.8835 0.8219
30 m&m final test 0.7218 0.8423 0.8924 0.8189
31 Predictive Analytics Logistic 0.7131 0.8336 0.8917 0.8128
32 DKW NN / Logistic Regression on Laptop 0.6980 0.8449 0.8928 0.8119
33 NICAL Dys 0.7108 0.8461 0.8707 0.8092
34 UW eq+uneq 0.6804 0.8531 0.8815 0.8050
35 Prem Swaroop thmdkd4 0.6972 0.8384 0.8794 0.8050
36 Dr. Bunsen Honeydew submission #004 0.7048 0.8235 0.8760 0.8015
37 dodio L2 0.7179 0.8474 0.8356 0.8003
38 FEG D TEAM mix2 0.6997 0.8139 0.8824 0.7987
39 minos rdf 0.6828 0.8233 0.8698 0.7920
40 M Release1 0.7289 0.8341 0.8053 0.7894
41 dataminers Ensemble Model 2 0.6850 0.8288 0.8205 0.7781
42 Weka1 final 0.6795 0.7727 0.8764 0.7762
43 idg b_1 0.6851 0.7931 0.8458 0.7747
44 HP Labs - Analytics Research lrs 0.6414 0.8042 0.8607 0.7687
45 hyperthinker L1-regularization 0.6770 0.7822 0.8386 0.7659
46 vodafone b 0.6819 0.7216 0.8917 0.7651
47 paberlo Method10 0.6717 0.7544 0.8451 0.7571
48 Lenca test 0.6713 0.7493 0.8456 0.7554
49 C.A.Wang Bagging 0.5956 0.8300 0.8369 0.7541
50 FEG B Naive Bayes & Logit 0.6499 0.8317 0.7777 0.7531
51 rw lst 0.6368 0.7045 0.8070 0.7161
52 Tree Builders VA - NN 0.6358 0.6583 0.7918 0.6953
53 Leo Naive Coding 0.5928 0.5544 0.7314 0.6262
54 ZhiGao Zhigao5 0.5425 0.5431 0.5774 0.5543
55 homehome etc 0.5835 0.3876 0.6290 0.5334
56 decaff zzz 0.5288 0.5009 0.5608 0.5302
57 Claminer only churn 0.5731 0.5095 0.5055 0.5294
58 Klimma simple 0.5034 0.5025 0.4965 0.5008
59 Reference Random predictions 0.5030 0.4889 0.5069 0.4996
End of Table
Full Results: Slow Track
 
Rank Team Name Method AUC
Churn Appetency Upselling Score
1 IBM Research Submission 0.7651 0.8819 0.9092 0.8521
2 Uni Melb The generally satisfactory model 0.7570 0.8836 0.9048 0.8484
3 ID Analytics, Inc DT with bagging (large + small) 0.7614 0.8761 0.9061 0.8479
4 Financial Engineering Group, Inc. Japan boosting 0.7589 0.8768 0.9074 0.8477
5 National Taiwan University, Computer Science and Information Engineering all_final 0.7558 0.8789 0.9036 0.8461
6 Hungarian Academy of Sciences last_after_last 0.7567 0.8736 0.9065 0.8456
7 Neo Metrics FINAL 0.7521 0.8756 0.9059 0.8445
8 Ming Li & Yuwei Zhang me 0.7512 0.8744 0.9059 0.8439
9 Data Mining Multiple Techniques 0.7574 0.8700 0.9036 0.8437
10 Tree Builders Sub2 Better Variables 0.7552 0.8736 0.9021 0.8436
11 dataminers Combined Model Submission 1 0.7553 0.8736 0.9016 0.8435
12 Oldham Athletic Reserves Tiberius Data Mining Algorithms 0.7525 0.8720 0.9028 0.8424
13 Swetha Logistic 0.7580 0.8652 0.9038 0.8423
14 Analytics Decision Tree Algo 0.7559 0.8691 0.9014 0.8421
15 Old dogs with new tricks Our own method 0.7488 0.8730 0.9040 0.8419
16 Sigma Enemble classifier 0.7580 0.8636 0.9041 0.8419
17 Weka1 finalRules 0.7477 0.8727 0.9049 0.8418
18 Predictive Analytics Adaptive Boosting Algorithm 0.7579 0.8676 0.8995 0.8416
19 LatentView Analytics Segmented Joint Score 0.7578 0.8675 0.8995 0.8416
20 Crusaders CART + Combination Logic 0.7579 0.8660 0.8995 0.8411
21 HP Labs - Analytics Research adds 0.7500 0.8653 0.9049 0.8401
22 VladN vn_large 0.7484 0.8597 0.9040 0.8373
23 brendano random forests (res11) 0.7468 0.8627 0.9003 0.8366
24 VADIS Final Slow 0.7442 0.8631 0.9013 0.8362
25 AIIALAB merge 0.7467 0.8551 0.8963 0.8327
26 ooo 3 0.7434 0.8583 0.8936 0.8318
27 m&m final test 0.7434 0.8486 0.8984 0.8301
28 Vadis Team 2 test1 0.7381 0.8493 0.9013 0.8296
29 commendo a2 bag 10 0.7321 0.8637 0.8920 0.8293
30 Kranf The Intelligent Mining Machine(TIM) 0.7369 0.8434 0.8963 0.8255
31 Christian Colot My GoldMiner 0.7183 0.8577 0.8958 0.8240
32 UW Final-2 0.7171 0.8455 0.8927 0.8184
33 AI Kahu 0.7255 0.8408 0.8872 0.8179
34 NICAL Dys 0.7108 0.8461 0.8707 0.8092
35 creon boosting combinations 0.7359 0.8268 0.8615 0.8081
36 LosKallos bit regurb 0.7398 0.8204 0.8621 0.8074
37 FEG_BOSS Boosting 0.7406 0.8149 0.8621 0.8058
38 Lenca RAE 0.7348 0.8175 0.8629 0.8051
39 Prem Swaroop thmdkd4 0.6972 0.8384 0.8794 0.8050
40 M final 0.7319 0.8153 0.8644 0.8038
41 FEG ATeam logit 0.7325 0.8160 0.8610 0.8031
42 pavel combfinal 0.7358 0.8130 0.8591 0.8027
43 Additive Groves Additive Groves 0.7135 0.8311 0.8605 0.8017
44 nikhop bteqwcomb 0.7359 0.8098 0.8589 0.8015
45 mi a 0.7365 0.8090 0.8569 0.8008
46 java. lang. OutOfMemory Error weka2 0.7360 0.8090 0.8572 0.8007
47 dodio L2 0.7179 0.8474 0.8356 0.8003
48 Lajkonik final submission 0.7323 0.8073 0.8600 0.7999
49 FEG CTeam test2 0.7321 0.8062 0.8596 0.7993
50 Céline Theeuws test 7 0.7230 0.8147 0.8584 0.7987
51 zlm dt 0.7232 0.8175 0.8544 0.7983
52 FEG B submit8 0.7354 0.8031 0.8544 0.7976
53 CSN cs_nott4 0.7282 0.8051 0.8594 0.7976
54 TonyM final 0.7249 0.7996 0.8596 0.7947
55 Sundance BT 0.7244 0.8172 0.8400 0.7939
56 minos rdf 0.6828 0.8233 0.8698 0.7920
57 Miner12 Model56 0.7264 0.7973 0.8484 0.7907
58 FEG D TEAM logit+tree 0.7219 0.8077 0.8422 0.7906
59 DKW Logistic Regression with interactions 0.7153 0.8015 0.8547 0.7905
60 idg disc3 0.7146 0.8041 0.8507 0.7898
61 homehome test572 0.7176 0.8062 0.8416 0.7885
62 Mai Dang Boosting GS10+NN+KR 0.7167 0.8099 0.8372 0.7880
63 parramining. blogspot. com Basic play 0.7134 0.8056 0.8438 0.7876
64 bob thirdM 0.7053 0.8052 0.8520 0.7875
65 muckl final2 0.7239 0.8195 0.8180 0.7871
66 C.A.Wang Bagging 0.7067 0.8043 0.8502 0.7871
67 KDD@PT c_a_u_ 0.7081 0.7989 0.8528 0.7866
68 decaff zzz 0.7120 0.7916 0.8498 0.7845
69 StatConsulting (K.Ciesielski, M.Sapinski, M.Tafil) Original232Vars 0.7137 0.7605 0.8501 0.7748
70 Dr. Bunsen Honeydew final submission 0.7170 0.8052 0.7954 0.7725
71 Raymond Falk Orange_ Small_ Results_ KDD2009_ OptInference 0.6905 0.7744 0.8465 0.7705
72 vodafone smallstm3.3 0.7258 0.6915 0.8582 0.7585
73 paberlo Method10 0.6717 0.7544 0.8451 0.7571
74 K2 Btest 0.7078 0.7670 0.7931 0.7560
75 Claminer class1 0.6665 0.7785 0.8199 0.7550
76 rw t 0.7257 0.6928 0.8369 0.7518
77 Leo Naive Coding 0.6528 0.7504 0.7693 0.7242
78 Persistent Hello_Theta 0.6416 0.7167 0.7370 0.6984
79 Louis Duclos-Gosselin Personal Algorithm 0.6168 0.7571 0.6792 0.6843
80 Chy LDA 0.6027 0.7201 0.6936 0.6721
81 Abo-Ali Weka 0.6249 0.6425 0.7218 0.6630
82 sduzx Straft 0.6057 0.6465 0.6167 0.6230
83 MT finnaly 0.5494 0.5378 0.6873 0.5915
84 Shiraz University - Undergradute Team lazy 0.5077 0.5047 0.7000 0.5708
85 ZhiGao Zhigao5 0.5425 0.5431 0.5774 0.5543
86 Klimma tree 0.5283 0.5231 0.5909 0.5474
87 hyperthinker knn 0.5000 0.5090 0.5000 0.5030
88 Reference Random predictions 0.4997 0.5057 0.5025 0.5026
89 thes bz 0.5016 0.4993 0.4982 0.4997
 

 

 

End of Table

FAQs

Participation and Registration

What is the goal of the challenge?

The challenge consists of several classification problems. The goal is to make the best possible predictions of a binary target variable from a number of predictive variables.

Can I enter under multiple names?

No, we limit each participant to one final entry, which may contain results on the large dataset only in the fast track and on either or both the small and the large dataset in the slow track. Registering under multiple names would be considered cheating and disqualify you. Your real identity must be known to the organizers. You may hide your identity only to the outside by checking the "Make my profile anonymous" in the registration form.

Can I participate to multiple teams?

No. Each individual is allowed to make only a single final entry into the challenge to compete towards the prizes. During the development period, each team must have a different registered team leader. To be ranked in the challenge and qualify for prizes, each registered participant (individual or team leader) will have to disclose the names of eventual team members, before the final results of the challenge get released. Hence, at the end of the challenge, you will have to choose to which team you want to belong (only one!), before the results are publicly released. After the results are released, no change in team composition will be allowed.

I understand that one person can join only one team, however, is it ok to have many teams in the same organization?

Yes it is OK. Each team leader must be a different person and must register and the teams cannot intersect. Before the end of the challenge the team leaders will have to declare the composition of their team. This will have to correspond to the list of co-authors in the proceedings, if they decide to publish their results. Hence a professor cannot have his/her name on all his/her students papers (but can be thanked in acknowledgements).

How do I register a team?

Only register the team leader and choose a nickname for your team. We'll let you know later how to disclose the members of your team.

Can the organizers enter the challenge?

No. The organizers may make entries under the common name "Reference" to stimulate the competition, but they do not compete towards the prizes.


Data: Download, format, etc.

I have problems with the ZIP files which appear to be corrupted. Can I get a DVD?

Try do download one archive at a time. If the problem persists, contact the organizers so they can send you a DVD.

Are the data available in other formats: matlab, SAS, etc.?

There are several Matlab versions posted on the Forum. There is also a numerical version of the categorical variables in text format for the large dataset. Please post your own version of the data to share it with others.

Is there sample code available?

Yes. We made available sample Matlab code to help you format your results. There are also examples to call CLOP models from that code. AT THIS STAGE THERE IS NOT YET MATLAB SUPPORT FOR HANDLING THE LARGE DATASET.

Are the true targets distributed similarly as the toy target?

No. The toy target is generated by an artificial stochastic process. The proportion of examples in either class is different in the real targets. The real targets have less than 10% of examples in the positive class.

I have observed that the last columns (after variable 14740) are not numerical, are the data corrupted?

The last variables are categorical variables. The strings correspond to category codes. This could be for instance a city name. But for reasons of privacy, the real names were replaced by strings that are meaningless.

I have observed that some columns are empty or constant, are the data corrupted?

No. This is correct, and part of the challenge, that deals with automatic data preparation and modeling in the context of industrial real data. Filtering constant data is the easy part of the challenge.

I have observed that the first chunk of the large dataset contains only 9999 lines, is this correct?

Yes. Chunk 1 contains 9999 data lines plus the header. All other chunks have no header. The last chunk has 10001 lines. So the total is 50000 lines of data.

In the categorical variables, do the value need to be handled as meaningful sequences or are they just codes?

The original categorical values where symbols, not indicating any category ordering. The category symbols have been replaced by random anonymized values (strings) with no semantic, in 1 to 1 bijection with the original values so as to keep the structure of the data.

Do the targets correspond to single or multiple products?

The targets correspond to single products (but not necessarily the same one). For instance, churn concerns mobile phone customers switching providers and up-selling the plan upgrade to include television.

Is there a meaning in the variable ordering?

No. The variables are randomly ordered.

Are the variables in the small dataset a subset of those in the large dataset?

Yes. However, they are disguised to make it non-trivial to identify and discourage people to do so. The examples are also ordered differently to render such mapping even harder. We wish that participants work on each dataset separately, although they may work on both.

Are the training and test data drawn from the same distribution?

Yes.

Are the set of categorical variable values the same in the training and test data?

Not necessarily. Some values might show up only in training data or only in test data.

Are there the same number of values in each line?

There can be missing values. The values are separated by tabulations. Two consecutive tabs indicate a missing value.

Is it allowed to unscramble the small dataset?

Scrambling was done to encourage the participants to work separately on the small dataset and the big dataset. If we wanted the participants to be able to use the features of the small dataset in addition to those they might select from the big one, we would not have scrambled the data. We realize however that, if we forbid the participants from unscrambling and consider it cheating, we would have difficulties enforcing that rule. Hence, participants who unscramble the small dataset will not be disqualified from the competition. All participants will be requested to report at the end of the challenge whether they made use of unscrambling and whether they derived some advantage from it.


Evaluation: Tracks, submission format, etc.

Why do we need to submit results on training data?

In this way we can assess the robustness of the models. If you make great predictions on training data and perform poorly on test data, your method likely is overfitting.

What is the purpose of giving performances on 10% or the test data?

We want to give feed-back to the participants to motivate them. In this way, they can see how roughly their performance compares to others. But, by giving feed-back on only 10% of the data, we avoid that they fine tune their system using the test data (i.e. de facto "learn" from the test data). There will be a slight bias in performance because of the 10% on which feed-back is provided, but it is the same bias for all contestants.

Is it correct that even if I submit the result on the large dataset in the fast track, I can submit the result on the large dataset in the slow track together with that on the small dataset?

Yes. In fact, you may submit as many times as you want. But, only the last complete entry (with churn appetency and upselling results both on training and test data = 6 files) will count in each track, depending on the submission date. In the fast track, you may enter only large dataset results, so you get 1 chance. In the slow track you may enter on both small and large datasets so you get 2 chances (the best of your 2 results will be taken into account). In total, you get 3 chances of winning.

If I submit results on both the small and large datasets in the slow track, how will results be evaluated?

The best of your 2 results will be taken into account.

Both small and large entries compete for the slow prize, but they seem to correspond to two distinct problems? Shouldn't there be two slow track prizes?

The small dataset is a downsized version of the large one: same examples, a subset of the features. To distinguish the two, the examples were ordered differently and the features were coded differently, in a way that should not affect performance but makes it non obvious to descramble. Because of the (unlikely) possibility that someone would spend time descrambling, we decided to give a single prize in the slow challenge, not to encourage people to cheat.

If I submit results before the fast track deadline, will those results also enter the slow track if I submit nothing afterwards?

Yes. For each deadline, your last valid complete entry will be entered in the ranking. So if you submit only to the fast track, your results will automatically be entered in the slow track.

If I win in both tracks, will I cumulate prizes?

No. You will get the largest of the two prizes. The remaining money will be used to give travel grants to other deserving participants.

On the result page, there is a "Score" column in the table, what does it mean?

As explained on the Tasks page, the score is the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling).

I see a bunch of xxxx instead of my score, is there a problem?

No. Until the data labels of the tasks of the challenge are released, if people submit something on those tasks, they cannot see results to prevent them from gaining information by guessing. Only results on the toy problem are shown. You may still practice submitting some random values to test the system, but you will not see the results.


DISCLAIMER

Can a participant give an arbitrary hard time to the organizers?

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ORANGE AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ORANGE AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.

Contacts

KDD Cup 2009 Chairs
  • David Vogel
    Data Mining Solutions
  • Isabelle Guyon
    Clopinet

Please visit the original KDD Cup 2009 website for more information.

KDD Cup 2008: Breast cancer

This Year's Challenge

The KDD Cup 2008 challenge focuses on the problem of early detection of breast cancer from X-ray images of the breast. In a screening population, a small fraction of cancerous patients have more than one malignant lesion. To simplify the problem, we only consider one type of cancer - cancerous masses - and only include cancer patients with at most one cancerous mass per patient. The challenge will consist of two parts, each of which is related to the development of algorithms for Computer Aided Detection (CAD) of early stage breast cancer from X-ray images.

Task Description

We propose to conduct two different yet closely related challenges based on this data. On the test data, the participants have to return two different files, one corresponding to each challenge.

1. The rate of prevalence of malignant patients in a screening environment is extremely low (on average only around 5-10 patients out of 1000 screening patients have breast cancer). Therefore, in the first challenge, the participating entries will be judged in terms of the area under the FROC curve in the clinically relevant region 0.2-0.3 False positives per image. To support this, the participants have to return a file with a confidence score for every candidate of the test set (from - infinity to + infinity) that indicates the confidence of their classifier that the candidate is malignant. A score of +infinity corresponds to absolute confidence that the candidate is malignant, and a score of - infinity indicates absolute confidence that the candidate is benign.

2. In the second challenge, our aim is to reduce the workload for radiologists, by asking them to only read a subset of cases that the algorithm deems at least somewhat unclear or suspicious. Thus our second challenge is evaluated in terms of the fraction of patients who are labeled as completely normal (not requiring radiologist review of images) such that the CAD algorithms have a 100% sensitivity of the malignant patients. (CAD systems which fail to have a 100% sensitivity will be disqualified from the challenge). To support this challenge, the participants have to return another file with a binary classification decision about whether each patient in the test set should be reviewed by a radiologist.


Background in Breast Cancer

Breast cancer is a disease in which malignant (cancer) cells form in the tissues of the breast. Breast cancer is the second leading cause of cancer deaths in women today (after lung cancer) and is the most common cancer among women, except for skin cancers. About 1.3 million women are expected to be diagnosed annually with breast cancer worldwide, and about 465,000 will die from the disease. In the United States alone, in 2007 an estimated 240,510 women were expected to be diagnosed with breast cancer, and 40,460 women are expected to have died from breast cancer.

Screening is looking for cancer in asymptomatic people - i.e., before a person has any symptoms of the disease. Cancer screening can help find cancer at an early stage. When abnormal tissue or cancer is found early, it is often easier to treat. By the time symptoms appear, cancer may have begun to spread. The good news is that breast cancer death rates have been dropping steadily since 1990, both because of earlier detection via screening and better treatments.

The most common breast cancer screening test is a mammogram. A mammogram is an x-ray of the breast. The ability of a mammogram to find breast cancer may depend on the size of the tumor, the density of the breast tissue, and the skill of the radiologist. The mammogram is considered the standard of care for most asymptomatic women. For instance, in the US, insurance companies routinely reimburse for an annual screening mammography examination, for all asymptomatic women over the age of 40. These exams are credited with reducing the breast cancer death rate by approximately 30% since 1990.

However, the reading of screening mammograms is challenging. Findings on a screening mammogram leading to further recall are identified in approximately 5%-10% of patients, even though breast cancer is ultimately confirmed in only three to ten cases in every 1,000 women screened. Perhaps even more importantly, there is compelling evidence that many breast cancers detected at screening mammography are, in retrospect, visible on the previously obtained mammograms but have been missed by the interpreting radiologist in the prior year. There are several reasons for this: The complex radiographic structure of breast tissue, particularly in dense breasts; the subtle nature of many mammographic characteristics of early breast cancer; human oversight; poor quality films and even fatigue or distraction are all reasons why cancer is not detected by mammography.

To overcome the known limitations of human observers, second (ie double) reading of screening mammograms by another radiologist has been implemented at many sites. Studies indicate a potential 4%-15% increase in the number of cancers detected with double reading. In a radiology practice that performs 10,000 screening examinations per year, generally between 30-100 cancers per year will be detected. Thus, double reading in this practice could contribute to the diagnosis of 1-15 additional cancers per year. However, this approach results in a doubling of the radiologist-effort so it is not financially viable.

Rapid and continuing advances in computer technology, as well as the ready adaptation of radiology images to digital formats, have increased the interest in computer prompting to enable the attending radiologist to act as his or her own second reader. One very promising adaptation of computer-prompting technology is computer-aided detection (CAD) in screening mammography. Current CAD systems demonstrate a high rate of detecting cancerous features on mammograms, but further improvements in both sensitivity and specificity would lead to tremendous benefits both in terms of lives saved each year, and in terms of reduction n the workload of radiologists. For the last 8-10 years, US insurance companies have begun to provide additional reimbursement to mammographers who run CAD algorithms on the mammograms - in other words, physicians are now reimbursed for running a machine learning algorithm to help them better detect cancer.

In an almost universal paradigm, the CAD problem is addressed by a 4 stage system:

  • candidate generation which identifies suspicious unhealthy candidate regions of interest (candidate ROIs, or simply candidates) from a medical image;
  • feature extraction which computes descriptive features for each candidate so that each candidate is represented by a vector x of numerical values or attributes;
  • classification which differentiates candidates that are malignant cancers from the rest of the candidates based on x; and
  • visual presentation of CAD findings to the radiologist.

In this challenge, we focus on stage 3, learning the classifier to differentiate malignant cancers from other candidates.


Hints

The obvious method of classification is to try to build classifiers that simply label each candidate independently. Below we present a few ideas that participants in the challenge may want to consider to potentially improve their algorithms.

Leverage two views of the same breast: Almost always, a cancerous lesion is visible in both views (MLO, CC) of the breast - radiologists routinely try to correlate the two views while diagnosing the patient. In rare cases, however, some lesions may only be visible in one view, especially in certain areas of the breast. However, negative candidates may either be present in one view (e.g., for image artifacts) or in both views (e.g., if generated by the presence of benign cyst).

Unfortunately, since each view is a 2D image obtained from an orthogonal direction, it is not possible to perfectly register (i.e., correlate the locations across) the X-ray images using simple algorithms, e.g., using affine transformations. However, some of a lesion's features are typically preserved across the two views; particularly, the distance of a lesion from the nipple, and perhaps some of the features themselves relating to size of the lesion, texture, etc. Thus the first idea that may be useful for this challenge is to develop algorithms that simultaneously classify candidates from a pair of images from the same breast. These algorithms could try to exploit correlations in classification decisions for the same region of a breast. To support this, training and testing data sets will include features that identify the (x,y) location of the nipple as well as the (x,y) location of the candidate.

Class Imbalance: Participants will be able to leverage ideas from classifier design under extreme class imbalance (the vast majority of the regions are normal, and only a small fraction of the regions are actually malignant), and feature selection (a large number of features are proposed and several of them may not be very useful for the task). The prevalence rate (malignant patients as a fraction of all patients) may differ between the training and testing sets.

Exploit correlations within an image: Participants may develop novel algorithms for exploiting potential correlations between the diagnoses
of suspicious regions within a single image (e.g. if they are spatially adjacent).

Optimize AUC only in narrow FP range: It may be useful to develop training algorithms to maximize the area under the ROC curve (AUC) in a clinically relevant false positive (FP) range, a problem that has not been adequately addressed in the machine learning/data-mining current literature.

Data Download

A breast cancer screen typically consists of 4 X-ray images; 2 images of each breast from different directions (these views are called MLO and CC). Thus, most (but not all) patients would have MLO and CC images of both their breasts, giving a total of 4 images per patient. For the purposes of the KDD Cup, each image is represented by several candidates (see stage 1 above). For each candidate, we provide the image ID and the patient ID, (x,y) location, several features, and a class label indicating whether or not it is malignant. We provide features computed from several standard image processing algorithms - 117 in all - but due to confidentiality reasons we are unable to provide some additional proprietary features. The labels indicate whether a candidate is malignant or benign (based on either a radiologist's interpretation or a biopsy or both). Note that several candidates can correspond to the same lesion. Thus, we also provide a unique lesion-ID for the malignant lesions in the training data. However lesion-ID information will not be included in the test data.

Training Data:

To support this KDD Cup challenge, training information is provided for a set of 118 malignant patients (patients with at least one malignant mass lesion). We also include data from 1594 normal patients - where all candidates are presumed to be benign. The training set consists of a total of 102,294 candidate ROIs, each described by 117 features, but only an extremely small fraction of these 102,294 candidates is actually malignant.

Test Data:

We provide data from over 1000 patients in the same format, except no class label or lesion-ID will be provided.

Supporting Software:

We also provide a software function written in Matlab for plotting Free Response Receiver Operating Curves (FROC). This function plots the sensitivity with which malignant cancers are detected (on the y-axis) versus the average number of false alarms (on the x-axis). A malignant cancer is correctly identified if at least one of the examples corresponding to the lesion is labeled as malignant by the classification algorithm.

NOTE: In order to better distinguish between the participants' entries, the training and testing data have been enriched with some difficult cases; further, proprietary features are not included in the dataset. The accuracy of the participants' entries to KDD Cup 2008 should not be considered to be representative of the underlying Siemens CAD software that generated the features.

Results

Winners of KDD Cup 2008: Challenge 1
  • First Place: PMG-IBM-Research
    Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.
    Affiliation: IBM Research
  • First Runner Up: Hung-Yi Lo
    Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin
    Affiliation: National Taiwan University
  • Second Runner Up: yazhene
    Team Members: Yazhene Krishnaraj and Chandan K. Reddy
    Affiliation: Wayne State University

Winners of KDD Cup 2008: Challenge 2
  • First Place: PMG-IBM-Research
    Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.
    Affiliation: IBM Research
  • First Runner Up: TZTeam
    Team Members: Didier Baclin
  • Second Runner Up: Hung-Yi Lo
    Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin
    Affiliation: National Taiwan University

Full Results: Challenge 1
 
Rank Name AUC
1 PMG-IBM-Research 0.0933343536925003
2 PMG-IBM-Research 0.0928931984058392
3 PMG-IBM-Research 0.0928931984058392
4 Lo Hung-Yi_2 0.0895233578219535
5 Hung-Yi Lo 0.0895232287765295
6 yazhene 0.0894083831508693
7 cailing 0.0881170171900512
8 Lo Hung-Yi 0.0875758667050807
9 IBM Research, Tokyo Research Laboratory 0.0874117689426676
10 IBM Research, Tokyo Research Laboratory (2) 0.0874117689426676
11 lodoktor 0.086343392874292
12 chuancong 0.0861978416402573
13 chuancong 0.0861978416402573
14 millerh 0.086008775088831
15 gasparpapanek 0.0859007370594447
16 IBM Research, Tokyo Research Laboratory (4) 0.0856741573033704
17 lodoktor 0.0855060981465474
18 smi539 0.0851639777201572
19 kddcup08 0.0850844497263034
20 kddcup08 0.0850844497263034
21 millerh 0.0848278594065107
22 lodoktor 0.0848248583501393
23 lodoktor 0.0848248583501393
24 gaspar 0.0847243229616822
25 gasparpapanek 0.0847243229616822
26 chuancong 0.084644794967829
27 chuancong 0.084644794967829
28 ran310 0.0843942067607797
29 lodoktor 0.0843431888024582
30 IBM Research, Tokyo Research Laboratory (3) 0.0842606597522332
31 kddcup08 0.0842546576394895
32 chuancong 0.0841316143282433
33 chuancong 0.0841316143282433
34 jdlict 0.0840595889753189
35 philb 0.0839785604532799
36 jdlict 0.0839635551714209
37 adayanik 0.0838845319552483
38 wxnorman 0.0837879933736679
39 wxnorman 0.0837879933736679
40 IBM Research, Tokyo Research Laboratory (1) 0.0837159680207432
41 jdlict 0.0836769542879096
42 jdlict 0.083576418899453
43 jimbeckman2 0.0834293671372322
44 jdlict 0.0832733122058964
45 jsnoek 0.0830017166042445
46 MU_CS_BARB 0.0828698976039566
47 adayanik 0.0828696701238834
48 KiD_IITM 0.0827511283971957
49 ran310 0.0826716004033418
50 kddcup08 0.0826130798040909
51 adayanik 0.0824720301546143
52 mehtakrishna 0.0824335956256604
53 hoonla 0.0820353764525114
54 goldenhelix 0.0818583141265725
55 KiD_IITM 0.081832805147412
56 OPA 0.0817900274896764
57 jxie 0.0814921852492077
58 ran310 0.0814471694036301
59 Team_FBTMB 0.0812753079083837
60 H.Hamaguchi 0.0812475991549026
61 bottyan 0.0812400965139728
62 FEG TEAM A 0.0809174829540001
63 family 0.0809114808412563
64 supertmac 0.0809114808412563
65 gtakacs 0.0807239148180163
66 utumno 0.0804868313646404
67 Janet Huang 0.0804808292518974
68 junling 0.0803802938634401
69 millerh 0.0800846898108134
70 appaliunas 0.0799691491404981
71 appaliunas 0.0799691491404981
72 bud_acad 0.079912193292039
73 vbrozh01 0.0795985186785754
74 vbrozh01 0.0795985186785754
75 sachinsg 0.0795904236291169
76 adayanik 0.0794619706136558
77 VladN 0.0794005401901469
78 vadis 0.0792248871602804
79 uran_chu 0.0791388522760009
80 nsitu 0.0789457889176992
81 nsitu 0.0789457889176992
82 millerh 0.0787477191971574
83 gtakacs 0.0786921996542781
84 gtakacs 0.0786816959569766
85 Team_FBTMB 0.0786454221886106
86 LIS 0.0785678064678767
87 chuancong 0.078561653702103
88 chuancong 0.078561653702103
89 petergehler 0.0785577355229038
90 Team_FBTMB 0.0784776643378469
91 jlries61 0.0783170676077979
92 tengke.xiong 0.0783042080812447
93 Team_FBTMB 0.0782184564966873
94 fegyaoy 0.0781685153173913
95 Kevin Pratt 0.078028966196101
96 homesugi 0.0780229640833573
97 Team_FBTMB 0.0779768786612885
98 neo-metrics 0.0777903822145392
99 Athens University of Economics and Business 0.0777138552770575
100 I2R_DM 0.0775848098530682
101 I2R_DM 0.0775848098530682
102 weifanibm2 0.0773822385479689
103 fegyaoy 0.0770551234034379
104 stefansavev 0.0770293611351195
105 erezsil 0.0769820032651491
106 FEG TEAM B2 0.0769500864304232
107 Erinija 0.0769293731393451
108 sdteddy 0.0766099713099013
109 petergehler 0.0763026283251702
110 petergehler 0.0762690254969748
111 bud_acad 0.0762448381830402
112 Mohammad Hasan 0.0760107557860365
113 bud_acad 0.0758135425669836
114 FEG Kohmoto 0.075640125324114
115 Tus 0.0755801041966772
116 bud_acad 0.0754990756746374
117 petergehler 0.0751312457985212
118 toodlewoodle 0.0750639225007203
119 kkang 0.0749828939786806
120 kkang 0.0749828939786806
121 Vadis 0.0749168707384999
122 vadis 0.0749168707384999
123 FEG Kohmoto 0.0744261980217039
124 LouisDuclosGoss 0.0741681071737256
125 LIS 0.0741621050609819
126 Vadis 0.0741380966100069
127 vadis 0.0741380966100069
128 petergehler 0.0740816485402862
129 Vadis 0.0738710025929127
130 vadis 0.0738710025929127
131 Jovice King 0.0738439930855664
132 toodlewoodle 0.0737509603380391
133 fegyaoy 0.0737164481897628
134 H.Hamaguchi 0.0733608230096997
135 Janet Huang 0.0732947997695192
136 FEG TEAM A 0.0731462474791128
137 neo-metrics 0.0731222390281381
138 Vadis 0.0731132358590227
139 vadis 0.0731132358590227
140 bud_acad 0.0730591358158073
141 fvandenb 0.0730247046960531
142 swenig 0.0730247046960531
143 JianfeiWu 0.0729714983674255
144 H.Hamaguchi 0.0729196677230388
145 Janet Huang 0.0728416402573706
146 I2R_DM 0.0724995198309807
147 I2R_DM 0.0724995198309807
148 H.Hamaguchi 0.0723959833861521
149 neo-metrics 0.0723404638432731
150 FEG Kohmoto 0.0722549337366755
151 Tus 0.0721994141937965
152 FEG TEAM A 0.0720493613752043
153 lkmpoon 0.0719848386632099
154 lkmpoon 0.0719848386632099
155 lkmpoon 0.0719848386632099
156 bud_acad 0.0719038101411701
157 TZTeam 0.0714536516853934
158 JianfeiWu 0.070874270743302
159 TZTeam 0.070812926150005
160 maxwang 0.0703837750888314
161 lhan 0.0703102492077212
162 lhan 0.0703102492077212
163 toodlewoodle 0.0700131446269087
164 FEG TEAM A 0.0699456208585424
165 Persistent 0.0698469887400365
166 vlad_g 0.0689692223662731
167 H.Hamaguchi 0.0689297632766734
168 Janet Huang 0.0688622395083072
169 Tus 0.0688457336982618
170 Vijayalekshmi K 0.068640078507635
171 tantz 0.0678824876356479
172 Tus 0.0675829089839624
173 mehmoodkhan 0.0668485306828002
174 chuancong 0.0668155190627101
175 chuancong 0.0668155190627101
176 chuancong 0.0668155190627101
177 hepei 0.0665536888984923
178 hepei 0.0663863680015365
179 homesugi 0.0663743637760491
180 ANA 0.0663623595505617
181 ANA 0.0663623595505617
182 SzucsGabor 0.0658059282867571
183 strombergd 0.0652399644674926
184 ran310 0.06494285988668
185 sachinsg 0.0644735762988571
186 Tus 0.0633812734082396
187 philb 0.063280274656679
188 ran310 0.062286924997599
189 Janet Huang 0.0621773864400269
190 sachinsg 0.0615418443292039
191 sachinsg 0.0615168329251897
192 abdellali 0.0606366987179488
193 uds_prospectus 0.0604599911168733
194 agmajmtraina 0.0594359214443487
195 agmajmtraina 0.0594359214443487
196 rouzbeh.stat 0.0548142628205129
197 majid tavafoghi 0.0548087294727745
198 ANA 0.0518662477191973
199 ANA 0.0517181840007684
200 LIS 0.0484484214443484
201 H.Hamaguchi 0.0445446797272642
202 FEG TEAM A 0.0426360078747721
203 luckyfreak 0.0424750540190147
204 luckyfreak 0.0420672968885047
205 ikmlab 0.0393517082012869
206 Yanglai 0.0372491116873139
207 DangMai 0.032565963219053
208 MaiDang 0.032565963219053
209 yschen 0.025923124939979
210 jdlict 0.023000096033804
211 bottyan 0.0226631554307116
212 bottyan 0.0224959185633343
213 FEG_DEVELOP_DIV 0.0195518822625564
214 GMSandCo12081971 0.016024140497455
215 NeuroTech_CIn_UFPE 0.013137167482954
216 socrates 0.0127649932776337
217 fegyaoy 0.0122488115816768
218 NatsumiAbe 0.0103251344473255
219 smi539 0.0103180069384423
220 tzeng@bsu.edu 0.00985729676846253
221 premswaroop 0.00931677950638629
222 premswaroop 0.00930027369634109
223 smi539 0.00873757562662061
224 premswaroop 0.00872304091040045
225 premswaroop 0.00827125708249311
226 ljchien 0.00815212774896766
227 smi539 0.0076872058964756
228 smi539 0.00750714251416502
229 somerandomteamsacg 0.00749710938250269
230 Vijayalekshmi K 0.0074970763708826
231 sachin 0.00749704455968506
232 sachinsg 0.00749701394891007
233 somerandomteamsacg 0.00749692931912038
234 socrates 0.00749690351003558
235 premswaroop 0.00656988920099879
236 bottyan 0.00628271151445311
237 fegyaoy 0.00598410640545472
238 ljchien 0.0052173076923077
239 ljchien 0.00473879465571882
240 ljchien 0.00356375444156345
241 ljchien 0.00270095073465861
242 ljchien 0.00270095073465861
243 sreedal 0
244 teamiisc 0
245 teammllab 0
 

 
 
 
 

End of Table

Full Results: Challenge 2

Rank Name Sensitivity Specificity
1 PokerFace 1 0.681575433911882
2 IBMClaudia 1 0.646862483311081
3 yanliu 1 0.624833110814419
4 TZTeam 1 0.174232309746328
5 Lo Hung-Yi 1 0.168891855807744
6 teamiisc 1 0.0974632843791722
7 I2R_DM 1 0.0947930574098798
8 Kevin Pratt 1 0.0907877169559413
9 bud_acad 1 0.0747663551401869
10 philb 1 0.0707610146862483
11 philb 1 0.068758344459279
12 sreedal 1 0.0607476635514019
13 philb 1 0.0380507343124166
14 toodlewoodle 1 0.0287049399198932
15 bud_acad 1 0.0233644859813084
16 PHILB 1 0.0186915887850467
17 I2R_DM 1 0.0133511348464619
18 fegyaoy 1 0.00534045393858478
19 NeuroTech_CIn_UFPE 1 0.00467289719626168
20 Gabor 1 0.000667556742323097
21 kkang 1 0.000667556742323097
22 Janet Huang 1 0
23 strombergd 1 0
24 FEG Kohmoto 0.990384615384615 0.347129506008011
25 kddcup08 0.990384615384615 0.323765020026702
26 FEG Kohmoto 0.990384615384615 0.298397863818425
27 fegyaoy 0.990384615384615 0.297062750333778
28 jdlict 0.990384615384615 0.269025367156208
29 fegyaoy 0.990384615384615 0.249666221628838
30 petergehler 0.990384615384615 0.227636849132176
31 millerh 0.990384615384615 0.223631508678238
32 appaliunas 0.990384615384615 0.20694259012016
33 appaliunas 0.990384615384615 0.20694259012016
34 Janet Huang 0.990384615384615 0.180240320427236
35 mehtakrishna 0.990384615384615 0.17890520694259
36 Team_FBTMB 0.990384615384615 0.174232309746328
37 Janet Huang 0.990384615384615 0.162216288384513
38 supertmac 0.990384615384615 0.15086782376502
39 Janet Huang 0.990384615384615 0.130841121495327
40 bud_acad 0.990384615384615 0.126835781041389
41 Jovice King 0.990384615384615 0.104806408544726
42 bud_acad 0.990384615384615 0.0987983978638184
43 toodlewoodle 0.990384615384615 0.0874499332443258
44 vlad_g 0.990384615384615 0.0594125500667557
45 toodlewoodle 0.990384615384615 0.0574098798397864
46 Janet Huang 0.990384615384615 0.0554072096128171
47 gaspar 0.990384615384615 0.0500667556742323
48 SzucsGabor 0.990384615384615 0.0427236315086782
49 tengke.xiong 0.980769230769231 0.425901201602136
50 millerh 0.980769230769231 0.32977303070761
51 hkashima 0.980769230769231 0.326435246995995
52 egyedul 0.980769230769231 0.316421895861148
53 jdlict 0.980769230769231 0.281708945260347
54 lodoktor 0.980769230769231 0.281708945260347
55 hkashima 0.980769230769231 0.279706275033378
56 jimbeckman2 0.980769230769231 0.258344459279039
57 jdlict 0.980769230769231 0.241655540720961
58 Team_FBTMB 0.980769230769231 0.239652870493992
59 toodlewoodle 0.980769230769231 0.2369826435247
60 teammllab 0.980769230769231 0.22630173564753
61 hkashima 0.980769230769231 0.220293724966622
62 Team_FBTMB 0.980769230769231 0.21762349799733
63 FEG TEAM B2 0.980769230769231 0.204272363150868
64 jxie 0.980769230769231 0.171562082777036
65 VladN 0.980769230769231 0.166889185580774
66 fegyaoy 0.980769230769231 0.107476635514019
67 KiD_IITM 0.980769230769231 0.0987983978638184
68 Vijayalekshmi K 0.980769230769231 0.0974632843791722
69 cyhsiao 0.971153846153846 0.57543391188251
70 LIS 0.971153846153846 0.380507343124166
71 petergehler 0.971153846153846 0.341789052069426
72 Team_FBTMB 0.971153846153846 0.305073431241656
73 gtakacs 0.971153846153846 0.297062750333778
74 neo-metrics 0.971153846153846 0.294392523364486
75 Erinija 0.971153846153846 0.287716955941255
76 neo-metrics 0.971153846153846 0.277703604806409
77 sdteddy 0.971153846153846 0.277036048064085
78 jdlict 0.971153846153846 0.268357810413885
79 Mohammad Hasan 0.971153846153846 0.2543391188251
80 vbrozh01 0.971153846153846 0.22630173564753
81 weifanibm2 0.961538461538462 0.493324432576769
82 petergehler 0.961538461538462 0.454606141522029
83 hkashima 0.961538461538462 0.425901201602136
84 Vadis 0.961538461538462 0.421228304405874
85 millerh 0.961538461538462 0.392523364485981
86 Vadis 0.961538461538462 0.345126835781041
87 hkashima 0.961538461538462 0.325100133511348
88 Vadis 0.961538461538462 0.297062750333778
89 hkashima 0.961538461538462 0.278371161548732
90 BEBOPE 0.961538461538462 0.229639519359146
91 chuancong 0.961538461538462 0.229639519359146
92 liu-r07 0.961538461538462 0.229639519359146
93 chuancong 0.961538461538462 0.22630173564753
94 hkashima 0.961538461538462 0.222296395193591
95 bud_acad 0.961538461538462 0.188251001335113
96 millerh 0.951923076923077 0.487983978638184
97 Team_FBTMB 0.951923076923077 0.343791722296395
98 chuancong 0.942307692307692 0.676902536715621
99 Liu Rui 0.942307692307692 0.676902536715621
100 Rui Liu 0.942307692307692 0.676902536715621
101 LIS 0.942307692307692 0.536048064085447
102 liu-r07 0.942307692307692 0.457943925233645
103 fvandenb 0.942307692307692 0.445260347129506
104 fvandenb 0.942307692307692 0.445260347129506
105 hkashima 0.942307692307692 0.42456608811749
106 FEG Kohmoto 0.942307692307692 0.421895861148198
107 fvandenb 0.942307692307692 0.383845126835781
108 swenig 0.942307692307692 0.383845126835781
109 fvandenb 0.942307692307692 0.347797062750334
110 fvandenb 0.942307692307692 0.295060080106809
111 FEG Kohmoto 0.932692307692308 0.559412550066756
112 H.Hamaguchi 0.932692307692308 0.55607476635514
113 fegyaoy 0.932692307692308 0.491989319092123
114 hkashima 0.932692307692308 0.255006675567423
115 FEG_DEVELOP_DIV 0.932692307692308 0.102803738317757
116 yazhene 0.923076923076923 0.875834445927904
117 weifanibm2 0.923076923076923 0.647530040053405
118 NaokoMatsushita 0.923076923076923 0.542056074766355
119 fvandenb 0.923076923076923 0.51068090787717
120 Vadis 0.923076923076923 0.458611481975968
121 FEG TEAM B2 0.923076923076923 0.449265687583445
122 weifanibm2 0.913461538461538 0.652870493991989
123 petergehler 0.913461538461538 0.563417890520694
124 weifanibm2 0.903846153846154 0.666889185580774
125 DePaul-Ekarin 0.903846153846154 0.665554072096128
126 weifanibm2 0.903846153846154 0.662883845126836
127 FEG Kohmoto 0.903846153846154 0.562082777036048
128 Vadis 0.903846153846154 0.541388518024032
129 NaokoMatsushita 0.894230769230769 0.596128170894526
130 premswaroop 0.884615384615385 0.781041388518024
131 millerh 0.884615384615385 0.644192256341789
132 FEG TEAM A 0.884615384615385 0.616154873164219
133 utumno 0.875 0.715620827770361
134 chuancong 0.875 0.705607476635514
135 FEG_DEVELOP_DIV 0.875 0.205607476635514
136 agmajmtraina 0.865384615384615 0.666221628838451
137 agmajmtraina 0.865384615384615 0.666221628838451
138 Tus 0.865384615384615 0.525367156208278
139 petergehler 0.855769230769231 0.677570093457944
140 luckyfreak 0.855769230769231 0.457276368491322
141 luckyfreak 0.855769230769231 0.457276368491322
142 luckyfreak 0.855769230769231 0.457276368491322
143 sachinsg 0.846153846153846 0.716955941255007
144 sachinsg 0.846153846153846 0.624165554072096
145 mehmoodkhan 0.846153846153846 0.58611481975968
146 FEG TEAM B2 0.836538461538462 0.695594125500668
147 LIS 0.836538461538462 0.673564753004005
148 sachinsg 0.826923076923077 0.746995994659546
149 FEG TEAM B2 0.826923076923077 0.71762349799733
150 LIS 0.826923076923077 0.679572763684913
151 FEG TEAM A 0.826923076923077 0.618825100133511
152 utumno 0.817307692307692 0.722963951935915
153 NaokoMatsushita 0.817307692307692 0.682242990654206
154 abdellali 0.788461538461538 0.699599465954606
155 uds_prospectus 0.788461538461538 0.698931909212283
156 premswaroop 0.778846153846154 0.880507343124166
157 ran310 0.778846153846154 0.853805073431242
158 lhan 0.778846153846154 0.817757009345794
159 lhan 0.778846153846154 0.817757009345794
160 H.Hamaguchi 0.778846153846154 0.800400534045394
161 ran310 0.759615384615385 0.879172229639519
162 FEG TEAM A 0.759615384615385 0.787049399198932
163 FEG TEAM A 0.740384615384615 0.727636849132176
164 FEG_DEVELOP_DIV 0.740384615384615 0.356475300400534
165 H.Hamaguchi 0.730769230769231 0.716288384512684
166 sachinsg 0.721153846153846 0.823097463284379
167 FEG TEAM A 0.721153846153846 0.744993324432577
168 premswaroop 0.711538461538462 0.895861148197597
169 H.Hamaguchi 0.711538461538462 0.844459279038718
170 Tus 0.701923076923077 0.706275033377837
171 ran310 0.692307692307692 0.901201602136182
172 sachinsg 0.692307692307692 0.817757009345794
173 rouzbeh.stat 0.692307692307692 0.722963951935915
174 premswaroop 0.682692307692308 0.933911882510013
175 NaokoMatsushita 0.682692307692308 0.788384512683578
176 majid tavafoghi 0.682692307692308 0.724299065420561
177 H.Hamaguchi 0.663461538461538 0.841121495327103
178 FEG TEAM B2 0.653846153846154 0.89652870493992
179 premswaroop 0.644230769230769 0.944592790387183
180 smi539 0.634615384615385 0.937249666221629
181 smi539 0.634615384615385 0.930574098798398
182 ran310 0.634615384615385 0.923898531375167
183 Tus 0.625 0.7543391188251
184 smi539 0.596153846153846 0.941255006675567
185 Tus 0.596153846153846 0.805740987983979
186 smi539 0.567307692307692 0.945927903871829
187 smi539 0.548076923076923 0.951268357810414
188 Tus 0.538461538461538 0.855140186915888
189 maxwang 0.519230769230769 0.96128170894526
190 smi539 0.490384615384615 0.953938584779706
191 tzeng@bsu.edu 0.432692307692308 0.691588785046729
192 NaokoMatsushita 0.413461538461538 0.889185580774366
193 ljchien 0.307692307692308 0.587449933244326
194 tantz 0.211538461538462 0.902536715620828
195 Yanglai 0.192307692307692 0.997329773030708
196 ljchien 0.192307692307692 0.775033377837116
197 GMSandCo12081971 0.173076923076923 0.757676902536716
198 synAnnypoliFerajna 0.153846153846154 0.997329773030708
199 ljchien 0.134615384615385 0.836448598130841
200 louis.gosselin 0.0576923076923077 0.991989319092123
201 DangMai 0.00961538461538462 0.988651535380507
202 MaiDang 0.00961538461538462 0.988651535380507

 

 

End of Table

FAQs

Announcements
  • The evaluation function get_ROC_KDD.m has been changed slightly on Apr 24, 2008 to fix a minor bug. Please download and use the new version (requires login). We would like to thank Will Dwinnell for identifying the bug.
  • Another bug was found in get_ROC_KDD.m. The file has been changed on May 27, 2008 to fix the bug. Please download and use the new version. We would like to thank Peter Gehler and Sebastian Nowozin for identifying the bug.
  • get_ROC_KDD.m was updated on June 19, 2008 to fix a bug. Please download and use the new version. We would like to thank Peter Gehler and Sebastian Nowozin for identifying the bug.
  • get_ROC_KDD.m was updated on June 23, 2008 to fix a minor bug. Please download and use the new version. We would like to thank Sinno Jialin Pan for identifying the bug.

Frequently Asked Questions

How do I format the results files?

  • Challenge 1
    The results file should be a list of scores (one per line) for each candidate which corresponds to the candidates exactly as given in the test file. The scores can be in the range -infinity, +infinity, or [0,1]. As long as the low score indicates low likelihood of malignancy and high score indicates high likelihood of malignancy, and as long as the get_ROC_KDD.m file works with your scores, our evaluation script will work. You can refer to pred_scores.txt to see example of the format.
  • Challenge 2
    The results file should be a list of patient_IDs and predicted label (-1 for benign and 1 for malignant patient), one per line, patient_ID and prediction separated by comma. Please see pred_per_patient_labels.txt for a sample file.

Note that the example files are provided as an example for format. The scores were generated completely at random.

How do I submit the results files?

The submission page is coming up early the week of June 16.

Contacts

KDD Cup 2010 Chair
  • Balaji Krishnapuram
    Siemens Medical Solutions, USA

Please visit the original KDD Cup 2008 website for more information.

KDD Cup 2007: Consumer recommendations

This Year's Challenge

This year's KDD Cup focuses on predicting aspects of movie rating behavior. There are two tasks. The tasks, developed in conjunction with Netflix, have been selected to be interesting to participants from both academia and industry You can choose to compete in either or both of the tasks.

Task Description

This year's tasks employ the Netflix Prize training data set. This data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles. The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received by Netflix during this period. The ratings are on a scale from 1 to 5 (integral) stars.

This year's competition consists of two tasks. Each team can participate in the competition of any one task or both tasks.

Task 1 (Who Rated What in 2006): Your task is to predict which users rated which movies in 2006. We will provide a list of 100,000 (user_id, movie_id) pairs where the users and movies are drawn from the Netflix Prize training data set. None of the pairs were rated in the training set. Your task is to predict the probability that each pair was rated in 2006 (i.e., the probability that user_id rated movie_id in 2006). (The actual rating is irrelevant; we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant.)

Task 2 (How Many Ratings in 2006): Your task is to predict the number of additional ratings the users from the Netflix Prize training dataset gave to a subset of the movies in the training dataset. We provide a list of 8863 movie_ids drawn from the Netflix Prize training dataset. You need to predict the number of additional ratings that all users in the Netflix Prize training dataset provided in 2006 for each of those movie titles. (Again the actual rating given by each user is irrelevant; we just want the number of times that the movie was rated in 2006. The date in 2006 when the rating was given is also irrelevant.)


Evaluation

Winners will be determined, for both tasks, by computing the root mean squared error (RMSE) between your individual predictions and the correct answers. That is, if your prediction for an item is Y, the correct answer for the item is X and we have n items, RMSE = sqrt((sum(for all items (X-Y)^2))/n). Entry with the smallest RMSE will be judged the winner; in case of a tie, the entry with the earliest submission date will be judged the winner.

  • In the case of "Who Rated What in 2006", the correct answer is 1 if the movie is rated by that user, 0 otherwise.
  • In the case of "How Many Ratings in 2006", the correct answer is the actual number of ratings received. However, RMSE is computed slightly differently from the first task. Assuming that the actual number of ratings received is X, in computing RMSE, we use ln(1+X), where "ln" is natural logarithm. This also applies to your predicted number. That is, assuming that your predicted number is Y, we use ln(1+Y).

Note: We reserve the right to use a different evaluation criterion if no team can achieve the baseline result for a task. For example, for task one, the baseline result is one that assigns each pair the probability of the base rate, which is the proportion of movies rated in the test set (unknown to contestants).

Following the award of the KDD Cup prizes, the answer sets will be made available at the KDD Cup website and the Netflix Prize website.

Competition Rules

By registering, you indicate your full and unconditional agreement and acceptance of these contest rules.

Eligibility: The contest is open to any party planning to attend KDD 2007. A person can participate in only one team. Each team can participate in either or both of the tasks.

Integrity: The contestant takes the responsibility of obtaining any permission to use any algorithms, tools, or additional data that are intellectual property of a third party. Permission is granted by Netflix to employ the Netflix Prize data set for the KDD Cup competition.

Registration: You must register before June 1, 2007 in order to participate. By registering, we mean two things; (1) sign up as an user of the site, and (2) create a submission (could be a dummy submission) for each task that you will participate (by clicking on the "Submit Paper" button). All the submitted information can be updated later, but no new submission can be made after June 1, 2007. Registration does not imply any commitment to participation. After registration, the system will give you a unique identifier (called Paper ID) for the submitted KDD Cup task. The ID will be used when submitting the corresponding results file. Separate registration is required on the Netflix Prize web site to download the training data set. We will keep your registration data private.

Data Download

Training Data Set

The Netflix Prize training dataset is available for download from here. You must register separately at that site to download the training dataset, even if you elect not to enter the Netflix Prize contest itself. The format of the training data is described on the Netflix Prize website and in the training dataset file. No additional training data will be provided.

Qualifying Answer Sets

The qualifying answer sets can be downloaded from the links below.

The user_ids and movie_ids are taken from the Netflix Prize training dataset.

Results

Winners of KDD Cup 2007: Tasks 1 - Who Rated What
  • First Place: Miklos Kurucz, Andras A. Benczur, Tamas Kiss, Istvan Nagy, Adrienn Szabo, Balazs Torma (Hungarian Academy of Sciences)
  • First Runner Up: Advanced Analytical Solutions Team of Neo Metrics directed by Jorge Sueiras (Neo Metrics)
  • Second Runner Up: Yan Liu, Zhenzhen Kou (IBM Research)

Winners of KDD Cup 2007: Tasks 2 - How Many Ratings
  • First Place: Saharon Rosset, Claudia Perlich, Yan Liu (IBM Research)
  • First Runner Up: Advanced Analytical Solutions Team of Neo Metrics directed by Jorge Sueiras (Neo Metrics)
  • Second Runner Up: James Malaugh (Team Lead), Sachin Gangaputra, Nikhil Rastogi, Rahul Shankar, Sandeep Gupta, Kushagra Gupta, Neha Gupta, Gaurav Lal (Inductis)

Final Results: Task 1 - Who Rated What

Results are listed using the Submission ID for privacy.

Rank Submission ID RMSE
1 117 0.256
2 43 0.263
3 184 0.265
4 187 0.267
5 86 0.268
6 100 0.269
7 168 0.271
8 148 0.272
9 195 0.274
10 59 0.279
11 175 0.279
12 138 0.279
13 134 0.282
14 185 0.282
15 190 0.283
16 19 0.288
17 180 0.304
18 110 0.304
19 49 0.310
20 51 0.316
21 38 0.319
22 40 0.321
23 21 0.325
24 54 0.336
25 123 0.377
26 29 0.389
27 197 0.452
28 196 0.456
29 24 0.506
30 193 0.535
31 162 0.580
32 10 0.621
33 83 0.623
34 137 0.642
35 57 0.661
36 78 0.685
37 179 0.773
38 192 0.812
39 17 0.882

End of Table
 
Final Results: Task 2 - How Many Ratingst

Results are listed using the Submission ID for privacy.

Rank Submission ID RMSE
1 194 774.253
2 79 1028.381
3 115 1325.156
4 188 1826.473
5 16 1906.046
6 109 2174.804
7 58 2451.966
8 163 2505.991
9 7 2631.011
10 174 3303.919
11 50 3567.428
12 33 3998.508
13 52 4422.552
14 118 4427.674
15 183 4746.861
16 53 5346.540
17 44 5358.311
18 191 5617.214
19 135 5847.389
20 182 7588.236
21 42 7820.966
22 189 8147.300
23 41 8280.541
24 140 9018.958
25 146 9735.349
26 23 9735.349
27 101 10176.672
28 87 10438.264
29 136 11973.700
30 124 27142.447
31 85 32061.956
32 149 33974.962
33 181 74126.234
34 18 213572.869
 

 

End of Table

Frequently Asked Questions

How were the qualifying answer sets were formed?

The number of ratings (from 1 to 5) given in 2006 per movie and per user was pulled from the Netflix ratings database, restricted to ratings given by people in the Prize dataset to movies in the Prize dataset. The set of movies were split randomly into two sets, one per task, resulting in 6822 movies for the "Who Rated What in 2006" task, and 8863 movies for the "How Many Ratings in 2006" task. For the "Who Rated What in 2006" task, a set of 100,000 (user_id, movie_id) pairs were generated by picking movie and user ids at random, restricted to the 6822 movie_ids in that task's set but for all the users in the Netflix Prize dataset. The probability of picking any given movie was proportional to the number of ratings that movie received in 2006; the probability of picking any given user was proportional to the number of ratings that user gave in 2006. Pairs that corresponded to ratings in the existing Netflix Prize dataset were discarded. Each selected (user_id, movie_id) pair was then looked up in the Netflix ratings database to see if the user rated that movie at any time during 2006.

Are we allowed to use external sources of information about the movies?

Yes. There is no restriction on what data you can/can't use to make your predictions. Please respect the terms of use for any source of information you employ.

Do I have to submit an algorithm description?

Only some top-ranked teams will be invited to submit workshop papers describing their algorithms.

Contacts

KDD Cup 2007 Chair
  • Bing Liu
    University of Illinois at Chicago

Please visit the original KDD Cup 2007 website for more information.

KDD Cup 2006: Pulmonary embolisms detection from image data

This Year's Challenge

This year's KDD Cup challenge problem is drawn from the domain of medical data mining. The tasks are a series of Computer-Aided Detection problems revolving around the clinical problem of identifying pulmonary embolisms from three-dimensional computed tomography data. This challenging domain is characterized by:

  • Multiple instance learning
  • Non-IID examples
  • Nonlinear cost functions
  • Skewed class distributions
  • Noisy class labels
  • Sparse data

Download the Gzipped tarball for scoring scripts, test data, raw results, Final PDF rules file, bootstrap vectors, etc.

Task Description

Challenge of Pulmonary Emboli Detection

Pulmonary embolism (PE) is a condition that occurs when an artery in the lung becomes blocked. In most cases, the blockage is caused by one or more blood clots that travel to the lungs from another part of your body. While PE is not always fatal, it is nevertheless the third most common cause of death in the US, with at least 650,000 cases occurring annually.1 The clinical challenge, particularly in an Emergency Room scenario, is to correctly diagnose patients that have a PE, and then send them on to therapy. This, however, is not easy, as the primary symptom of PE is dysapnea (shortness of breath), which has a variety of causes, some of which are relatively benign, making it hard to separate out the critically ill patients suffering from PE.

The two crucial clinical challenges for a physician, therefore, are to diagnose whether a patient is suffering from PE and to identify the location of the PE. Computed Tomography Angiography (CTA) has emerged as an accurate diagnostic tool for PE. However, each CTA study consists of hundreds of images, each representing one slice of the lung. Manual reading of these slices is laborious, time consuming and complicated by various PE look-alikes (false positives) including respiratory motion artifacts, flowrelated artifacts, streak artifacts, partial volume artifacts, stair step artifacts, lymph nodes, and vascular bifurcation, among many others. Additionally, when PE is diagnosed, medications are given to prevent further clots, but these medications can sometimes lead to subsequent hemorrhage and bleeding since the patient must stay on them for a number of weeks after the diagnosis. Thus, the physician must review each CAD output carefully for correctness in order to prevent overdiagnosis. Because of this, the CAD system must provide only a small number of false positives per patient scan.

The goal of a CAD system, therefore, is to automatically identify PE's. In an almost universal paradigm for CAD algorithms, this problem is addressed by a 3 stage system:
1. Identification of candidate regions of interest (ROI) from a medical image,
2. Computation of descriptive features for each candidate, and
3. Classification of each candidate (in this case, whether it is a PE or not) based on its features.

In this year's KDD Cup data, Steps 1 and 2 have been done for you. Your goal is to design a series of classifiers related to Step 3.

Classification Tasks

Task 1: The first classification task is to label individual PE's. For clinical acceptability, it is critical to control false positive rates - a system that "cries wolf" too often will be rejected out of hand by clinicians. Thus, the goal is to detect as many true PE's as possible, subject to a constraint on false positives.

For this task, we make the following definitions:

  • PE sensitivity is defined as the number of PE's correctly identified in a patient. A PE is correctly identified if at least one of the candidates associated with that PE is correctly labeled as a positive. Note: identifying 2 or more candidates for the same PE makes no impact on the sensitivity.
  • False positives are defined as the number of candidates falsely labeled as a PE in a patient - i.e., the total of all negative candidates labeled as PEs in the patient.
  • The average FP rate for a test set is the average number of FPs produced across all patients in that test set.

You may (probably should) use different classifiers for each sub-task below:

  • Task 1a. Build a system where the false positive rate is at most 2 per patient.
  • Task 1b. Build a system where the false positive rate is at most 4 per patient.
  • Task 1c. Build a system where the false positive rate is at most 10 per patient.

In each task, the classifiers will be ranked based on PE sensitivity, as long as the false
positive rate meets the specified threshold.

Task 2: The second classification task is to label each patient as having a PE or not. The reason this is important is that patient treatment for PE is systemic - i.e., many aspects of the treatment are the same whether the patient has one or many PE's. For this task, we make the following definitions:

Patient sensitivity is defined as the number of patients for whom at least one true PE is correctly identified. As above, a PE is correctly identified if any one of the candidates associated with that PE is correctly labeled, and multiple correct identifications in a single patient do not increase the sensitivity score.

False positives are defined as the number of candidates falsely labeled as a PE in a patient.

The average FP rate for a test set is the average number of FPs produced across all patients in that test set.

Again, for this task, 3 classifiers should be built, and any classifier that yields an average FP rate above the specified FP threshold on any sub-task will be disqualified.

  • Task 2a. Build a system where the false positive rate is at most 2 per patient.
  • Task 2b. Build a system where the false positive rate is at most 4 per patient.
  • Task 2c. Build a system where the false positive rate is at most 10 per patient.

In each task, the classifiers will be ranked based on patient sensitivity as long as the false positive rate obeys the specified FP rate. You may use the same classifier(s) as in Task 1, or build different classifiers for this task.

Task 3: One of the most useful applications for CAD would be a system with very high (100%?) Negative Predictive Value. In other words, if the CAD system had zero positive candidates for a given patient, we would like to be very confident that the patient was indeed free from PE's. In a very real sense, this would be the "Holy Grail" of a PE CAD system.

Unfortunately, as our training set contains relatively few negative PE cases (20 in all), building such a classifier may be a very hard task. However, we anticipate that we will have a larger number of negative cases by the time the test data set is released, allowing a better measure of the performance of the system on this task.

For this task, we make the following definitions:

A patient is identified as negative when the CAD system produces no positive labels for any of that patient's candidates.

The negative prediction value (NPV) for a classifier is TN/(TN+FN) (i.e., number of true negatives divided by the total of true and false negatives).

Note that the NPV is maximized by a classifier that correctly identifies some negative patients but produces no false negatives (no positive patients identified as negative).

To qualify for this task, a classifier must have 100% NPV (i.e., when it says a patient has no positive marks, the patient must have no true PE's). The primary criterion will be the highest number of negative patients identified in the test set (largest TN), subject to a minimum cut-off of identifying 40% of the negative patients on the test set. The first tie breaker will be the sensitivity on PE's (as defined in Task 1), followed by the false positive rate on the entire test set.

More task descriptions are provided in this PDF file.


Evaluation

Each submission will be evaluated according to the criteria set forth under each task, above. The winner for each task will be the group with the best score according to the specified metric for that task. In the event of a tie, multiple winners may be awarded or, at the chair's option, a tie-breaking metric may be employed. Results of the competition will be announced individually to participants in advance of KDD; public announcement of results will be during the opening ceremony of KDD.

Competition Rules

Eligibility: The contest is open to any party planning to attend KDD 2006. Each of the three tasks will be evaluated separately; you can enter as many tasks (or as few tasks) as you like. A person can participate in only one group per task.

Registration: Each participating group must register with the competition in order to gain access to the training data. The registration must indicate a single "group lead" who will be point of contact for the group. Each registered group lead will be subscribed to the KDD Cup 2006 mail list. The mail list will be used for contact with the participating groups and to announce rule clarifications, availability of additional data, etc. Groups are also encouraged to use this list to post questions and hold discussions.

Participation in tasks: This year's KDD Cup consists of three different tasks. A group may choose to submit to any or all of these tasks. Performance in one task will not positively or negatively impact the evaluation of performance in a different task. If a group chooses to participate in Tasks 1 or 2, they must submit results for all of the sub-tasks. The decision to participate in a given task will be made during results submission; a group does not have to commit to any particular tasks when registering to receive the training data.

Data Download

For this task, a total of 69 cases were collected and provided to an expert chest radiologist, who reviewed each case and marked the PEs. The cases were randomly divided into training and test sets. The training set includes 38 positive cases and 8 negative cases, while the test set contains the remaining 23 cases. The test group is sequestered and will only be used to evaluate the performance of the final system.

Note: It turns out that patients numbers 3111 and 3126 in the test data duplicate patients 3103 and 3115, respectively, in the training data. Therefore, 3111 and 3126 were dropped from the final evaluation for the competition. These patients should be discarded from testing for any future comparisons to the KDD Cup results. The file drop_duplicate_patients_mask.dat, in the KDD Cup archive file, can be used to identify the excluded rows in the KDDPETest.txt file.

More data descriptions are provided in this PDF file. My Note: These 7 files failed to download.


Data Format

The training data will consist of a single data file plus a file containing field (feature) names. Each line of the file represents a single candidate and comprises a number of whitespace-separated fields:

  • Field 0: Patient ID (unique integer per patient)
  • Field 1: Label - 0 for "negative" - this candidate is not a PE; >0 for "positive" - this candidate is a PE
  • Field 2+: Additional features

The testing data will be in the same format data file, except that Field 1 (label) will be "-1" denoting "unknown".

Results

Winners of KDD Cup 2006: Task 1 - PE Identification
  • First Place: Robert Bell, Patrick Haffner, and Chris Volinsky (AT & T Research)
  • First Runner Up: Dmitriy Fradkin (Ask.com)
  • Second Runner Up: Domonkos Tikk (Budapest University of Technology & Economics), Zsolt T. Kardkovacs (Budapest University of Technology & Economics), Ferenc P. Szidarovszky (Szidarovszky Ltd. and Budapest University of Technology & Economics), Gyorgy Biro (TextMiner Ltd.), and Zoltan Balint (Budapest University of Technology & Economics)
  • Best Student Entry: Karthik Kumara (team leader), Sourangshu Bhattacharya, Mehul Parsana, Shivramkrishnan K, Rashmin Babaria, Saketha Nath J, and Chiranjib Bhattacharyya (Indian Institute of Science)

Winners of KDD Cup 2006: Task 2 - Patient Classification
  • First Place: Domonkos Tikk (Budapest University of Technology & Economics), Zsolt T. Kardkovacs (Budapest University of Technology & Economics), Ferenc P. Szidarovszky (Szidarovszky Ltd. and Budapest University of Technology & Economics), Gyorgy Biro (TextMiner Ltd.), and Zoltan Balint (Budapest University of Technology & Economics)
  • First Runner Up: Ruiping Wang, Yu Su, Ting Liu, Fei Yang, Liangguo Zhang, Dong Zhang, Shiguang Shan, Weiqiang Wang, Ruixiang Sun, and Wen Gao (Institute of Computing Technology, Chinese Academy of Sciences)
  • Second Runner Up: Cas Zhang, Y. Zhou, Q. Wang, and H. Ge (Joint R & D Lab, Chinese Academy of Sciences)
  • Third Runner Up: Dmitriy Fradkin (Ask.com)
  • Best Student Entry: Zhang Cas (IA, PKU)

Winners of KDD Cup 2006: Task 3 - Negative Predictive Value
  • First Place: William Perrizo and Amal Shehan Perera (DataSURG Group, North Dakota State University)
  • Runner Up: Nimisha Gupta and Tarun Agarwal (Strand Life Sciences Pvt. Ltd.)
  • Best Student Entry: Karthik Kumara (team leader), Sourangshu Bhattacharya, Mehul Parsana, Shivramkrishnan K, Rashmin Babaria, Saketha Nath J, and Chiranjib Bhattacharyya (Indian Institute of Science)

Final Results: Task 1 - PE Identification

Results are listed using the Download ID for privacy.

Download ID T1a Mean sensitivity T1b Mean sensitivity T1c Mean sensitivity Final score
ef76e0193ef8372113a26b9be97368d3 1.17 1.31 1.58 1.35
d0b226d5cb7acf1960c3c366bba827f6 1.00 1.25 1.58 1.28
2b33b8ba98ec223548eab4e64ae88dc4 0.93 1.36 1.51 1.27
8e3126f2523ecc6acbc87bff0e3d49c4 0.97 1.34 1.40 1.24
f0edf80fda6fbba3e06b24eb9b3df1fc 1.08 1.25 1.38 1.24
9b5ae2264450b107c5ee952ec5a72eeb 1.04 1.20 1.39 1.21
3ac232f02b893cb57a4487c4a5330ea4 0.90 1.24 1.43 1.19
4b27bd0bf34dc747b0211623c2eb65a6 0.75 1.15 1.55 1.15
2781ba05576561b182e2809ee9768ccb 0.84 1.12 1.47 1.14
c671036b3aafa55d0dde37175e7fca96 0.83 1.06 1.52 1.14
bf8f0121ad3f7f3b397310b9a576f81f 0.72 0.99 1.45 1.05
77617854db97705991f2e10fab7f09c2 0.90 1.02 1.21 1.04
4d79702fce261f0344b32c4a089f7882 0.99 0.88 1.20 1.02
92bbb197efdf5ea61ddf64bdf55e13a6 0.76 1.02 1.26 1.01
ec776e4d8991ab9a3772a2422481f27f 0.58 1.10 1.35 1.01
c6ac5e1d49084ec5e373b12bc38583c6 0.71 0.94 1.26 0.97
0d36b6b56a1d3605b3f7ee304faa9993 0.70 0.94 1.25 0.96
572de68bda39fccd166160a0d0e7e599 0.81 0.82 1.10 0.91
32074f9c7d8ac756ef5c77d4ae50d849 0.65 0.74 1.07 0.82
17364b8e5ccb947dd0bf56c9c2c90f5e 0.57 0.83 0.98 0.79
769fb06eceb15b7282dc9122b72add00 0.69 0.78 0.82 0.76
7481dec4886cc40f2bd34a1ae8d89d5f 0.61 0.60 0.71 0.64
c17cbd4bddbc415337fdc68752dd9436 0.48 0.48 0.48 0.48
f2d2a194dd9c97589d2b1c56d07db28f 0.37 0.41 0.56 0.45
ddb0fceec6f0e804b4ac5ed3fa65da4d 0.36 0.42 0.53 0.44
656708047c26984c5bc03e7450ebd90b 0.36 0.42 0.49 0.42
64e5a8252b98e4769a7b37ec946922cf 0.27 0.34 0.39 0.33
ddfbe24b943c4889ead6ca1d62011ff2 0.18 0.24 0.28 0.23
90edf490e4c28d1b3b38256fe617e3cd 0.00 0.12 0.56 0.23
65c983aba5d5c6b5e0bc765df792768e 0.18 0.21 0.28 0.22
05dd608fccd3b3597517c08e71ad5aee 0.18 0.22 0.23 0.21
81325a4d88994b5966af400b0f97a4eb 0.10 0.14 0.16 0.14
ddbfd582abf308ccbf1ee9c0501ebd13 0.08 0.11 0.12 0.10
c602ab8a2dd48089d1e371b6d83c613e 0.06 0.06 0.06 0.06
a3c34f54cf332e8b3e77b0f000aab967 0.03 0.03 0.03 0.03
da6f41f17af96a64ebc21ce87a4b0799 0.01 0.01 0.01 0.01
e55a770983977c02fd5113c244a5addc 0.00 0.00 0.00 0.00
e38fab690e3ff7a2f39198ca3eeb0b71 0.00 0.00 0.00 0.00
d62220e5b3f14aa9a3c723f0c94faa84 0.00 0.00 0.00 0.00
ffa56cf7cc759edb75dd225a9d75b8ef 0.00 0.00 0.00 0.00
67b82ccc8b29b330f180dcdf82f72fcd 0.00 0.00 0.00 0.00
9f0dd861b210f17dc03340243e708ddf 0.00 0.00 0.00 0.00
b4e3aa99276f8c677d4d7dea4dc1f6b5 0.00 0.00 0.00 0.00
7f10d1b1fdb510968190a701441fdee2 0.00 0.00 0.00 0.00
71cac307edab1362183242d1dd37f5da 0.00 0.00 0.00 0.00
55fdd5d6670559625e697f8b43b4df25 0.00 0.00 0.00 0.00
416dcd50fda8f5aebfd2a28211258203 0.00 0.00 0.00 0.00
82a93157f65f7f213f8df789392a2bc2 0.00 0.00 0.00 0.00
a1d33a3898c587ae85476d7711dd8652 0.00 0.00 0.00 0.00
7f9df23b03c5556ca06989aca584f5e7 0.00 0.00 0.00 0.00
5b11eeaa3abd2fbe27a8782c7ac95282 0.00 0.00 0.00 0.00
dc222863b2fa5486325c403e99551f30 0.00 0.00 0.00 0.00
472841ed288953007171fea8c977d711 0.00 0.00 0.00 0.00
3c5362554b598ca99371d8f9520b61fa 0.00 0.00 0.00 0.00
27fb957db3aca93091b2dfde22bb0d48 0.00 0.00 0.00 0.00
6ea66b3f045715c0711d40503f69ee37 0.00 0.00 0.00 0.00
e1fffae6ba2eb5db5b3a752019ee7a25 0.00 0.00 0.00 0.00
c979a7d590b53d5fd6e1b57a8c7028fe 0.00 0.00 0.00 0.00
c9cd5a32ed1c924e696d49c4739eb042 0.00 0.00 0.00 0.00
40a5e5f32863b93a9749759e46e8287b 0.00 0.00 0.00 0.00
b10608c680dce2e340b7d91cc773e5db 0.00 0.00 0.00 0.00
0446ffe9133982b27ae939c97c303713 0.00 0.00 0.00 0.00
71076e0b39f6ee18cb6415e4d6e79d11 0.00 0.00 0.00 0.00
7c9ca4fea7553d210f421665b7fd1fb2 0.00 0.00 0.00 0.00

End of Table
Final Results: Task 2 - Patient Classification

Results are listed using the Download ID for privacy.

Download ID T2a Mean sensitivity T2b Mean sensitivity T2c Mean sensitivity Final score
2b33b8ba98ec223548eab4e64ae88dc4 11.50 14.34 14.90 13.58
3ac232f02b893cb57a4487c4a5330ea4 11.56 13.74 15.39 13.56
656708047c26984c5bc03e7450ebd90b 11.18 13.74 15.39 13.44
d0b226d5cb7acf1960c3c366bba827f6 12.02 13.74 14.33 13.36
8e3126f2523ecc6acbc87bff0e3d49c4 10.98 13.99 14.07 13.01
c671036b3aafa55d0dde37175e7fca96 11.11 12.48 14.46 12.68
3c5362554b598ca99371d8f9520b61fa 11.78 12.37 13.15 12.43
b4e3aa99276f8c677d4d7dea4dc1f6b5 11.19 11.95 13.97 12.37
2781ba05576561b182e2809ee9768ccb 10.68 12.24 13.94 12.28
92bbb197efdf5ea61ddf64bdf55e13a6 10.79 12.55 13.45 12.26
4b27bd0bf34dc747b0211623c2eb65a6 8.79 12.58 14.07 11.81
4d79702fce261f0344b32c4a089f7882 11.90 10.13 12.94 11.65
67b82ccc8b29b330f180dcdf82f72fcd 9.25 12.49 12.49 11.41
ec776e4d8991ab9a3772a2422481f27f 8.65 11.27 14.04 11.32
0d36b6b56a1d3605b3f7ee304faa9993 10.21 10.45 12.13 10.93
572de68bda39fccd166160a0d0e7e599 10.15 9.79 12.63 10.86
77617854db97705991f2e10fab7f09c2 9.98 10.54 11.37 10.63
769fb06eceb15b7282dc9122b72add00 9.66 10.91 11.32 10.63
f0edf80fda6fbba3e06b24eb9b3df1fc 8.05 10.89 12.38 10.44
7481dec4886cc40f2bd34a1ae8d89d5f 10.56 8.22 9.65 9.48
32074f9c7d8ac756ef5c77d4ae50d849 8.37 9.22 10.63 9.41
17364b8e5ccb947dd0bf56c9c2c90f5e 7.14 8.93 9.26 8.44
c17cbd4bddbc415337fdc68752dd9436 8.27 8.27 8.27 8.27
da6f41f17af96a64ebc21ce87a4b0799 7.48 7.65 9.22 8.12
81325a4d88994b5966af400b0f97a4eb 6.35 7.51 8.48 7.44
65c983aba5d5c6b5e0bc765df792768e 6.91 7.02 7.43 7.12
9b5ae2264450b107c5ee952ec5a72eeb 5.07 5.96 7.01 6.01
ef76e0193ef8372113a26b9be97368d3 5.18 5.47 6.74 5.79
ddfbe24b943c4889ead6ca1d62011ff2 5.15 5.63 6.28 5.68
f2d2a194dd9c97589d2b1c56d07db28f 5.11 5.33 6.27 5.57
ddb0fceec6f0e804b4ac5ed3fa65da4d 4.55 5.04 5.98 5.19
bf8f0121ad3f7f3b397310b9a576f81f 3.58 5.27 5.54 4.80
90edf490e4c28d1b3b38256fe617e3cd 0.00 2.44 9.29 3.91
ddbfd582abf308ccbf1ee9c0501ebd13 2.98 3.83 4.50 3.77
64e5a8252b98e4769a7b37ec946922cf 2.96 3.63 3.81 3.47
05dd608fccd3b3597517c08e71ad5aee 2.34 2.39 2.44 2.39
c602ab8a2dd48089d1e371b6d83c613e 0.91 0.91 0.91 0.91
c6ac5e1d49084ec5e373b12bc38583c6 0.41 0.46 0.54 0.47
27fb957db3aca93091b2dfde22bb0d48 0.14 0.16 0.20 0.16
a3c34f54cf332e8b3e77b0f000aab967 0.08 0.07 0.07 0.07
e55a770983977c02fd5113c244a5addc 0.00 0.00 0.00 0.00
e38fab690e3ff7a2f39198ca3eeb0b71 0.00 0.00 0.00 0.00
ac80868189c69ec9bec887b92da65312 0.00 0.00 0.00 0.00
d62220e5b3f14aa9a3c723f0c94faa84 0.00 0.00 0.00 0.00
ffa56cf7cc759edb75dd225a9d75b8ef 0.00 0.00 0.00 0.00
9f0dd861b210f17dc03340243e708ddf 0.00 0.00 0.00 0.00
7f10d1b1fdb510968190a701441fdee2 0.00 0.00 0.00 0.00
71cac307edab1362183242d1dd37f5da 0.00 0.00 0.00 0.00
55fdd5d6670559625e697f8b43b4df25 0.00 0.00 0.00 0.00
416dcd50fda8f5aebfd2a28211258203 0.00 0.00 0.00 0.00
82a93157f65f7f213f8df789392a2bc2 0.00 0.00 0.00 0.00
a1d33a3898c587ae85476d7711dd8652 0.00 0.00 0.00 0.00
7f9df23b03c5556ca06989aca584f5e7 0.00 0.00 0.00 0.00
5b11eeaa3abd2fbe27a8782c7ac95282 0.00 0.00 0.00 0.00
dc222863b2fa5486325c403e99551f30 0.00 0.00 0.00 0.00
472841ed288953007171fea8c977d711 0.00 0.00 0.00 0.00
6ea66b3f045715c0711d40503f69ee37 0.00 0.00 0.00 0.00
c979a7d590b53d5fd6e1b57a8c7028fe 0.00 0.00 0.00 0.00
c9cd5a32ed1c924e696d49c4739eb042 0.00 0.00 0.00 0.00
40a5e5f32863b93a9749759e46e8287b 0.00 0.00 0.00 0.00
b10608c680dce2e340b7d91cc773e5db 0.00 0.00 0.00 0.00
0446ffe9133982b27ae939c97c303713 0.00 0.00 0.00 0.00
71076e0b39f6ee18cb6415e4d6e79d11 0.00 0.00 0.00 0.00
7c9ca4fea7553d210f421665b7fd1fb2 0.00 0.00 0.00 0.00

End of Table
Final Results: Task 3 - Negative Predictive Value

Results are listed using the Download ID for privacy.

Download ID T3 NPV T3 TN rate T3 Sensitivity T3 FP rate
7c9ca4fea7553d210f421665b7fd1fb2 0.41 0.91 0.24 7.60
4d79702fce261f0344b32c4a089f7882 0.42 0.47 0.52 3.30
da6f41f17af96a64ebc21ce87a4b0799 0.17 0.35 0.21 2.65
769fb06eceb15b7282dc9122b72add00 0.11 0.21 0.14 12.83
ddfbe24b943c4889ead6ca1d62011ff2 0.10 0.20 0.15 7.04
8e3126f2523ecc6acbc87bff0e3d49c4 0.15 0.16 0.20 2.53
55fdd5d6670559625e697f8b43b4df25 0.10 0.10 0.12 5.40
5b11eeaa3abd2fbe27a8782c7ac95282 0.06 0.07 0.07 5.29
7f10d1b1fdb510968190a701441fdee2 0.07 0.07 0.06 4.35
ddbfd582abf308ccbf1ee9c0501ebd13 0.06 0.06 0.09 14.76
65c983aba5d5c6b5e0bc765df792768e 0.02 0.05 0.01 1.76
ffa56cf7cc759edb75dd225a9d75b8ef 0.04 0.04 0.03 11.39
c64787fd5316495ea2093174649aa421 0.03 0.03 0.03 6.69
17364b8e5ccb947dd0bf56c9c2c90f5e 0.03 0.03 0.03 3.31
d62220e5b3f14aa9a3c723f0c94faa84 0.02 0.02 0.01 17.73
2b33b8ba98ec223548eab4e64ae88dc4 0.01 0.02 0.01 5.96
f0edf80fda6fbba3e06b24eb9b3df1fc 0.01 0.02 0.01 5.50
472841ed288953007171fea8c977d711 0.01 0.02 0.00 15.24
3ac232f02b893cb57a4487c4a5330ea4 0.01 0.01 0.01 12.55
c979a7d590b53d5fd6e1b57a8c7028fe 0.01 0.01 0.01 47.80
64e5a8252b98e4769a7b37ec946922cf 0.01 0.01 0.01 8.39
a1d33a3898c587ae85476d7711dd8652 0.01 0.01 0.01 50.85
dc222863b2fa5486325c403e99551f30 0.01 0.01 0.01 6.18
6ea66b3f045715c0711d40503f69ee37 0.00 0.00 0.00 27.94
32074f9c7d8ac756ef5c77d4ae50d849 0.00 0.00 0.00 26.84
c6ac5e1d49084ec5e373b12bc38583c6 0.00 0.00 0.00 39.46
7f9df23b03c5556ca06989aca584f5e7 0.00 0.00 0.00 20.58
92bbb197efdf5ea61ddf64bdf55e13a6 0.00 0.00 0.00 39.58
6055901d71789196e727117c7e70426d 0.00 0.00 0.00 34.28
9b5ae2264450b107c5ee952ec5a72eeb 0.00 0.00 0.00 39.76
c9cd5a32ed1c924e696d49c4739eb042 0.00 0.00 0.00 20.27
c8b7e5bf28378975589f3748fbc9819f 0.00 0.00 0.00 46.12
656708047c26984c5bc03e7450ebd90b 0.00 0.00 0.00 11.66
0446ffe9133982b27ae939c97c303713 0.00 0.00 0.00 13.97
90edf490e4c28d1b3b38256fe617e3cd 0.00 0.00 0.00 19.00
ef76e0193ef8372113a26b9be97368d3 0.00 0.00 0.00 9.11
c602ab8a2dd48089d1e371b6d83c613e 0.00 0.00 0.00 7.14
ec776e4d8991ab9a3772a2422481f27f 0.00 0.00 0.00 5.32
b4e3aa99276f8c677d4d7dea4dc1f6b5 0.00 0.00 0.00 4.35
71076e0b39f6ee18cb6415e4d6e79d11 0.00 0.00 0.00 4.34
416dcd50fda8f5aebfd2a28211258203 0.00 0.00 0.00 3.60
e38fab690e3ff7a2f39198ca3eeb0b71 0.00 0.00 0.00 3.74
b10608c680dce2e340b7d91cc773e5db 0.00 0.00 0.00 3.00
ddb0fceec6f0e804b4ac5ed3fa65da4d 0.00 0.00 0.00 2.20
a3c34f54cf332e8b3e77b0f000aab967 0.00 0.00 0.00 2.16
67b82ccc8b29b330f180dcdf82f72fcd 0.00 0.00 0.00 1.57
05dd608fccd3b3597517c08e71ad5aee 0.00 0.00 0.00 1.79
bf8f0121ad3f7f3b397310b9a576f81f 0.00 0.00 0.00 1.43
d0b226d5cb7acf1960c3c366bba827f6 0.00 0.00 0.00 1.32
81325a4d88994b5966af400b0f97a4eb 0.00 0.00 0.00 1.26
572de68bda39fccd166160a0d0e7e599 0.00 0.00 0.00 1.47
7481dec4886cc40f2bd34a1ae8d89d5f 0.00 0.00 0.00 1.12
0d36b6b56a1d3605b3f7ee304faa9993 0.00 0.00 0.00 0.88
c671036b3aafa55d0dde37175e7fca96 0.00 0.00 0.00 0.90
f2d2a194dd9c97589d2b1c56d07db28f 0.00 0.00 0.00 1.05
c17cbd4bddbc415337fdc68752dd9436 0.00 0.00 0.00 0.51
 

End of Table

Frequently Asked Questions

The spec says that there are 46 positive cases and 20 negative cases, but I only count 38 positive and 8 negative in the training data. What gives?

Don't panic! As the README included with the data indicates, this release is a preliminary data set - it may not match the spec in all ways. The slightly longer story is that Siemens is preparing a more extensive version of the training data. But it turned out to take longer to produce that data than they had expected, and we judged that it was more important to go forward with the competition than to wait on the final data. So we released the data that is currently available. The spec is based on Siemens's projections of what the final data will look like and is out of synch with the preliminary data that we released. The "final" data may or may not become available in time to be of use during the competition; if it does become available I will release it as soon as I can. At that point, I'll correct the spec to reflect the final status of the training data.

How do you define candidates which are associated with PE (see pdf-specifications file)?

The 'label' field actually gives the index of the PE with which the candidate is associated. That is, label=0 means 'this candidate not associated with a PE', while label=X means 'this candidate associated with PE number X' (for X!=0).

Why does the 'KDDPEfeatureNames.txt' file have fewer feature names (116, including Patient ID) than the 'KDDPETrain.txt' file fields (117)?

The last feature name should be Tissue Feature and is missing from the original feature names file.

What's with normalization? The contest documentation says "all features are normalized from 0 to 1," but there are negative numbers all over the place.

Yes, that's a mistake in the documentation. The normalization is into a unit range and roughly a zero mean. But feel free to re-normalize the data in any way that you like.

How many submissions (i.e. sets of predictions) are allowed for each subtask? What should be the format for the submissions?

I'm working on sorting out the submission process now. I'll let you all know more when I do.

Once the evaluation is done, will the labels of the test data be released to the participants?

Yes. Once the competition is complete, the intention is to release the entire data set.

Can you give us more info on the meaning of each of the features? Background info, interpretation, and so on?

Siemens is unwilling to provide this information. They are hoping for "abstract" approaches that aren't engineered to specific feature sets. (E.g., so that they're easily extensible to different medical problems.)

How far apart should candidates be before I can reasonably assume they are independent (there is no way I can tell this from the processed data)? I.e. how far apart should they be to represent a "different" part of the lung and not be looking at essentially the same structure?

That's a good question. It's part of the challenge for you to determine that. No prior domain knowledge is available on this point.

There appear to be strong correlations within a patient. Can you explain that? Will we be given patient ID in the test data?

There are indeed strong correlations within patient and even within patient groups from a single hospital. Part of the challenge is to find useful ways to exploit that information. You will be given patient ID in the final test data, but you will not be given a hospital ID and you may not assume that all of the patients are drawn from the same hospital. The test data patient set will be disjoint from the training data patient set.

Can I return "unknown" labels for some of the data points?

I'll say more about the answer format in the future but the short answer is: no. You must commit to an answer for each candidate. In practice, if your algorithm was embedded in a piece of medical imaging equipment, you would be required to return an absolute yes/no answer to the physician - there is no room for "I don't know".

Is there label noise in the training or test data?

Answer from Siemens: "We have high confidence that the labels given in the training and test data are correct. They have been rigorously examined individually by domain experts. But there is always an element of human error - there may be small errors in either direction, but we believe that such errors are minimal, at best. There exists the possibility that there are PEs in the original data to which no candidates were assigned at all, but this is beyond the scope of this competition. (Such PEs will have no candidates in the test data, so you won't be scored on them one way or the other.)"

Contacts

KDD Cup 2006 Chair
  • Terran Lane
    University of New Mexico

Please visit the original KDD Cup 2006 website for more information.

KDD Cup 2005: Internet user search query categorization

This Year's Challenge

This year's competition is about classifying internet user search queries. The task was specifically designed to draw participation from industry, academia, and students.

Task Description

This year's competition focuses on query categorization.

Your task is to categorize 800,000 queries into 67 predefined categories. The meaning and intention of search queries is subjective. A search query "Saturn" might mean Saturn car to some people and Saturn the planet to others. We will use multiple human editors to classify a subset of queries selected from the total set given to you. The collection of human editors is assumed to have the most complete knowledge about internet as compared with any individual end user. A portion of the editor labeled queries is given to you (CategorizedQuerySample.txt in the zip file for downloading) and the rest will be held back for evaluation. You will not know which queries will be used for evaluation and are asked to categorize all queries given.

You should tag each query with up to 5 categories. If the submission does not contain all search queries, those not included will be treated as having no category tags.


Evaluation

The evaluation will run on the held back queries and rank your results by how closely they match to the results from human editors. Here are the set of measures we will use to evaluate results submitted by the contestants:

  

You will be asked to submit your algorithms. The interestingness, scalability, and efficiency of the algorithms will also be judged. New ideas in handling search queries and internet content will be valued and most innovative ideas will be selected by KDD Cup co-chairs.

Competition Rules

Agreement: By sending the registration email, you indicate your full and unconditional agreement and acceptance of these contest rules.

Eligibility: The contest is open to any party planning to attend KDD 2005. A person can participate in only one group. Multiple submissions per group are allowed, since we will not provide feedback at the time of submission. Only the last submission before the deadline will be evaluated and all other submissions will be discarded.

Integrity: The contestant takes the responsibility of obtaining any permission to use any algorithms/tools/data that are intellectual property of third party.

Winner Selection: There will be three prizes awarding "Query Categorization Precision Award" , "Query Categorization Performance Award", and "Query Categorization Creativity Award". One winner will be selected for each award.

The winners will be determined according to the following method. All participants are ranked according to their overall performance and average precision on the test set. Participants will also be ranked based on their creativity of their methodologies

Winner of "Query Categorization Performance Award" is the participant who has the best average performance rank in terms of F1 defined below.

Among the participants who have top 10 F1 scores, we will honor the winner of "Query Categorization Precision Award" to be the one who has the best average precision.

Winner of "Query Categorization Creativity Award" is the participant whose model has a top 20 average rank in terms of F1 defined below and is highly outstanding at its creative ideas judged by the KDD Cup co-chairs and a group of search experts. The scalability and level of automation of the model will also be considered in the judgment.

An honorable mention will be awarded for each prize.

Data

MISSING?

Results

Winners
  • Query Categorization Precision Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
  • Query Categorization Performance Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
  • Query Categorization Creativity Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

Runner-ups
  • Query Categorization Precision Award
    Budapest University of Technology team
    Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi
  • Query Categorization Performance Award
    MEDai/AI Insight/ Humboldt University team
    David S. Vogel, Steve Bridges, Steffen Bickel, Peter Haider, Rolf Schimpfky, Peter Siemen, Tobias Scheffer
  • Query Categorization Creativity Award
    Budapest University of Technology team
    Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi
Full Results

The following table contains the evaluation results for the submitted solutions we received. The solutions are listed in random order. The organizer will send the "Submission ID" to each individual participant. Once you receive your "Submission ID" for your solution, you can use it to access your evaluation result in the following table.

 

Submission ID
Precision
F1
1
0.145099
0.146839
2
0.116583
0.139732
3
0.339435
0.309754
4
0.110885
0.124228
5
0.31068
0.085639
6
0.254815
0.246264
7
0.263953
0.306359
8
0.454068
0.405453
9
0.264312
0.306612
10
0.334048
0.342248
11
0.107045
0.116521
12
0.196117
0.207787
13
0.326408
0.357127
14
0.317308
0.312812
15
0.271791
0.26545
16
0.050918
0.060285
17
0.264009
0.218436
18
0.206167
0.247854
19
0.136541
0.127008
20
0.127784
0.126848
21
0.340883
0.34009
22
0.414067
0.444395
23
0.237661
0.250293
24
0.244565
0.258035
25
0.753659
0.205391
26
0.255726
0.274579
27
0.206919
0.205302
28
0.148503
0.17614
29
0.171081
0.1985
30
0.145467
0.173173
31
0.108305
0.108174
32
0.16962
0.232654
33
0.469353
0.255096
34
0.198284
0.191618
35
0.32075
0.384136
36
0.211284
0.129937
37
0.423741
0.426123

End of Table

Frequently Asked Questions

Can we submit a separate solution for each evaluation criteria (award)?

Yes. When submitting, you can include, in the compressed file, multiple files with one for each solution. In this case, you need to specify which file is for which award clearly. However, your solution dedicated to precision must be in the top 10 F1 scores in order to become the candidate of "Query Categorization Precision Award". Also, in the algorithm description part, you need to clearly specify which algorithm was used for which award. Note that only one solution is allowed for each of the "Query Categorization Precision Award" and "Query Categorization Performance Award" from any participant team. The Query Categorization Creativity Award" will be based on the description of the algorithm(s) used for the other two awards.

Are we allowed to use external sources (e.g. documents from directories) to increase the knowledge of the classifier?

Yes, you will decide what methodology or resource to use to classify the queries. There is no restriction on what data you can/can't use to build your models.

How to label trash queries or non-English queries?

The evaluation set contains only valid English queries. Participants may have their system return no labels on this type of non-English or trash query.

Do I have to submit an algorithm description?

You need to submit an algorithm description. If you do not want to share the details of your techniques, you can just give a high level description of your approach and please indicate "a brief summary" at the beginning of your description. In this case, you will not participate in "Query Categorization Creativity Award".

How will the evaluation query set be selected?

1. The queries for evaluation will be selected randomly. 
2. Foreign language queries / trash queries / improper content queries will be dropped from the evaluation set during the selection process.

Contacts

KDD Cup 2005 Chairs
  • Ying Li
    Microsoft Corp.
  • Zijian Zheng
    Amazon.com

Please visit the original KDD Cup 2005 website for more information. My Note: Page Not Found

KDD Cup 2004: Particle physics; plus protein homology prediction

This Year's Challenge

This year's competition focuses on data-mining for a variety of performance criteria such as Accuracy, Squared Error, Cross Entropy, and ROC Area. As described on this WWW-site, there are two main tasks based on two datasets from the areas of bioinformatics and quantum physics.

KDD Cup 2004: Software


The PERF Software

UPDATE 5th of July: One of the participant pointed out that on some platforms, PERF might underestimate APR by a small amount when there are certain numbers of ties. Most likely, this will not affect your results. However, we put a new version that is more robust on the KDD-Cup Web page. The new version is 5.11. Just in case you want to be absolutely certain, you might want to use the new code. Note again, that this only affects APR on the protein problem. However, if you are still using a version of PERF prior to 5.10, you should download and use the new version. See the FAQ, or README file in the download, for more detail.

We will use the program perf to measure the performance of the predictions you submit on the eight performance metrics. You do not need to use perf, but using perf will insure that the metrics you optimize to are defined the same way we will measure them.

Perf calculates a variety of performance metrics suitable for boolean classification problems. Metrics include: accuracy, root-mean-squared-error, cross-entropy, precision, recall, precision/recall break-even point and F-score, area under the ROC curve, lift, weighted cost, top 1, top 10, rank of lowest positive case, q-score, several measures of probability calibration, etc.

Perf is available for download already compiled for a number of platforms:

Each directory contains a subdirectory with sample test data and sample output. You can us this to test that perf works on your platorm, and to see what the input format to perf is.

We recently made a number of changes to perf for the KDD-CUP, so please let us know if you find bugs.

Perf is a stand-alone C-program that is easy to compile on different platforms. Here is the source code and makefile:

Perf can read from the standard input, or from files containing target values and predicted values. When reading from the standard input, the input to perf is a series of lines, each of which contains a target value and predicted value separated by whitespace. Perf reads the entire input corresponding to the targets and predictions for a train or test set, and then calculates the performance measures you request. Here is a short example of the input file. The first column is the target class (0 or 1). The second column is the probabilities the model predicts the case is in class 1. The input format allows any kind of whitespace to separate the two columns (e.g. spaces, tabs, commas).

1 0.80962
0 0.48458
1 0.65812
0 0.16117
0 0.47375
0 0.26587
1 0.71517
1 0.63866
0 0.36296
1 0.89639
0 0.35936
0 0.22413
0 0.36402
1 0.41459
1 0.83148
0 0.23271

The software can calculate a variety of performance measures, but you won't need most of them for the competition. For the KDD-CUP 2004 competition the performance measures we are interested in are:

  • For the Particle Physics Problem:
    • ACC: accuracy
    • ROC: area under the ROC curve (aka AUC)
    • CXE: cross-entropy
    • SLQ 0.01: Stanford Linear Accelerator Q-score (more on this later)
  • For the Protein Matching Problem:
    • TOP1: how often is a correct match (a homolog) ranked first
    • RMS: root-mean-squared-error (similar to optimizing squared error)
    • RKL: rank of the last matching case (rank of the last positive case)
    • APR: average precision

If you specify no options, perf prints a variety of performance measures. Typically you will specify options so that perf only calculates the performance metric(s) you are interested in, but here is sample output of perf when run on one of the test data sets included in the perf_sample_test_data directory with no options specified:

[caruana] perf < testperfdata

ACC 0.83292 pred_thresh 0.500000
PPV 0.35294 pred_thresh 0.500000
NPV 0.96203 pred_thresh 0.500000
SEN 0.71429 pred_thresh 0.500000
SPC 0.84680 pred_thresh 0.500000
PRE 0.35294 pred_thresh 0.500000
REC 0.71429 pred_thresh 0.500000
PRF 0.47244 pred_thresh 0.500000
LFT 3.36975 pred_thresh 0.500000
SAR 0.78902 pred_thresh 0.500000
wacc 1.000000 wroc 1.000000 wrms 1.000000

ACC 0.90524 freq_thresh 0.617802
PPV 0.54762 freq_thresh 0.617802
NPV 0.94708 freq_thresh 0.617802
SEN 0.54762 freq_thresh 0.617802
SPC 0.94708 freq_thresh 0.617802
PRE 0.54762 freq_thresh 0.617802
REC 0.54762 freq_thresh 0.617802
PRF 0.54762 freq_thresh 0.617802
LFT 5.22846 freq_thresh 0.617802
SAR 0.81313 freq_thresh 0.617802
wacc 1.000000 wroc 1.000000 wrms 1.000000

ACC 0.91521 max_acc_thresh 0.712250
PPV 0.68182 max_acc_thresh 0.712250
NPV 0.92876 max_acc_thresh 0.712250
SEN 0.35714 max_acc_thresh 0.712250
SPC 0.98050 max_acc_thresh 0.712250
PRE 0.68182 max_acc_thresh 0.712250
REC 0.35714 max_acc_thresh 0.712250
PRF 0.46875 max_acc_thresh 0.712250
LFT 6.50974 max_acc_thresh 0.712250
SAR 0.81645 max_acc_thresh 0.712250
wacc 1.000000 wroc 1.000000 wrms 1.000000

PRB 0.54762
APR 0.51425
ROC 0.88380
R50 0.49954
RKL 273
TOP1 1.00000
TOP10 1.00000
SLQ 0.80851 Bin_Width 0.010000
RMS 0.34966
CXE 0.57335
CA1 0.22115 19_0.05_bins
CA2 0.22962 Bin_Size 100

To make the output simpler, you can specify only the measure(s) you are interested in. For example, to compute just the ROC Area or just the average precision:

[caruana] perf -roc < testperfdata

ROC 0.88380

[caruana] perf -apr < testperfdata

APR 0.51425

To compute the accuracy, cross-entropy, and root-mean-squared-error:

[caruana] perf -acc -cxe -rms < testperfdata

ACC 0.83292 pred_thresh 0.500000
RMS 0.34966
CXE 0.57335

Note that accuracy needed a threshold and perf used a default threshold of 0.5. If you want to use a different threshold (e.g. a threshold of 0 when using SVMs), the threshold can be specified with a "-threshold" option:

[caruana] perf -acc -threshold 0.0 -cxe -rms < testperfdata

ACC 0.10474 pred_thresh 0.000000
RMS 0.34966
CXE 0.57335

Note that the threshold changed only the accuracy, but not the RMS or CXE. Predictions below threshold are treated as class 0 and predictions above threshold are treated as class 1. When submitting predictions for the KDD-CUP for accuracy (the only performance measure we are using in the cup that depends on a threshold) you will be asked to submit a threshold as well.

Perf can read from files instead of from the standard input:

[caruana] perf -acc -threshold 0.0 -cxe -rms -file testperfdata

ACC 0.10474 pred_thresh 0.000000
RMS 0.34966
CXE 0.57335

Note that the file option must be the last option specified.

Perf has a variety of other options not described here. Perf can plot ROC curves and precision-recall plots, automatically select thresholds that maximize accuracy or make the frequency of the cases predicted to be positive match the number of cases that are positive in the data set (both of these should be used to find thresholds on train or validation sets, and then you should specify that threshold with the "-threshold" option when testing on test sets -- finding optimal thresholds directly on test sets usually is a no-no), display confusion matrices, calculate cost when unequal costs apply to false positives and false negatives, etc. A tutorial on perf is currently being prepared, but you really won't need for the KDD-CUP. To see a list of perf's options, run perf with an illegal option such as "-help":

[caruana] perf -help

Error: Unrecognized program option -help
Version 5.00 [KDDCUP-2004]

Usage:
./perf [options] < input
OR ./perf [options] -file <input file>
OR ./perf [options] -files <targets file> <predictions file>

Allowed options:

PERFORMANCE MEASURES
-ACC Accuracy
-RMS Root Mean Squared Error
-CXE Mean Cross-Entropy
-ROC ROC area [default, if nothing else selected]
-R50 ROC area up to 50 negative examples
-SEN Sensitivity
-SPC Specificity
-NPV Negative Predictive Value
-PPV Positive Predictive Value
-PRE Precision
-REC Recall
-PRF F1 score
-PRB Precision/Recall Break Even Point
-APR Mean Average Precision
-LFT Lift (at threshold)
-TOP1 Top 1: is the top ranked case positive
-TOP10 Top 10: is there a positive in the top 10 ranked cases
-NTOP <N> How many positives in the top N ranked cases
-RKL Rank of *last* (poorest ranked) positive case
-NRM <arg> Norm error using L metric
-CST <tp> <fn> <fp> <tn>
Total cost using these cost values, plus min-cost results
-SAR <wACC> <wROC> <wRMS><br> SAR = (wACC*ACC + wROC*ROC + wRMS(1-RMS))/(wACC+wROC+wRMS)
typically wACC = wROC = wRMS = 1.0
-CAL <bin_size> CA1/CA2 scores
-SLQ <bin_width> Slac Q-score

METRIC GROUPS
-all display most metrics (the default if no options are specified)
-easy ROC, ACC and RMS
-stats Accuracy, confusion table metrics, lift
-confusion Confusion table plus all metrics in stat

PLOTS (Only one plot is drawn at a time)
-plot roc Draw ROC plot
-plor pr Draw Precision/Recall plot
-plot lift Draw Lift versus threshold plot
-plot cost Draw Cost versus threshold plot
-plot acc Draw Accuracy versus threshold plot

PARAMETERS
-t <arg> Set threshold [default threshold is 0.5 if not set]
-percent <arg> Set threshold so <arg> percent of data falls above threshold (predicted positive)

INPUT
-blocks Input has BLOCK ID numbers in first column. Calculate performance for each block and report the mean performance across the blocks. Only works with APR, TOP1, RKL, and RMS.
If using separate files for target and predictions input (-file option), the BLOCK ID numbers must be the first column of the target file, with no block numbers in the predictions file.

-file <file> Read input from one file (1st col targets, 2nd col predictions)
-files <targets file> <predictions file> Read input from two separate files

Tasks

Task Description

Introduction

This year's competition focuses on various performance measures for supervised classification. Real world applications of data-mining typically require optimizing for non-standard performance measures that depend on the application. For example, in direct marketing, error rate is a bad indicator of performance, since there is a strong imbalance in cost between missing a customer and making the advertisement effort slightly too broad. Even for the same dataset, we often want to have different classification rules that optimize different criteria.

In this year's competition, we ask you to design classification rules that optimize a particular performance measure (e.g. accuracy, ROC area, lift). For each of the two datasets described below, we provide a supervised training set and a test set where we held back the correct labels. For each of the two test sets, you will submit multiple sets of predictions. Each set is supposed to maximize the performance according to a particular measure. We will provide software for computing the performance measures in the software area. The same software will be used to determine the winners of the competition. The particular performance measures are described in the documentation of the software.

Particle Physics Task

The goal in this task is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments. It is a binary classification problem with 78 attributes. The training set has 50,000 examples, and the test set is of size 100,000. Your task is provide 4 sets of predictions for the test set that optimize

  • accuracy (maximize)
  • ROC area (maximize)
  • cross entropy (minimize)
  • q-score: a domain-specific performance measure sometimes used in particle physics to measure the effectiveness of prediction models. (maximize: larger q-scores indicate better performance.)

We thank the author of the dataset, who wishes to remain anonymous until after the KDD-Cup.

Software that calculates q-score (and the other seven performance measures) is available from the Software download web page. We'll add a description of q-score to the FAQ web page in the next few days. Until then, just think of it as a black-box performance metric for which you are supposed train models that will maximize this score.

Protein Homology Prediction Task

Additional information about the Protein Homology problem is now available from Ron Elber.

For this task, the goal is to predict which proteins are homologous to a native sequence. The data is grouped into blocks around each native sequence. We provide 153 native sequences as the training set, and 150 native sequences as the test set. For each native sequences, there is a set of approximately 1000 protein sequences for which homology prediction is needed. Homologous sequences are marked as positive examples, non-homologous sequences (also called "decoys") are negative examples.

If you are not familiar with protein matching and structure prediction, it might be helpful to think of this as a WWW search engine problem. There are 153 queries in the train set, and 150 queries in the test set. For each query there are about 1000 returned documents, only a few of the documents are correct matches for each query. The goal is to predict which of the 1000 documents best match the query based on 74 attributes that measure match. A good set of match predictions will rank the "homologous" documents near the top.

Evaluation measures are applied to each block corresponding to a native sequence, and then averaged over all blocks. Most of the measures are rank based and assume that your predictions provide a ranking within each block. Your task is provide 4 sets of predictions for the test set that optimize

  • fraction of blocks with a homologous sequence ranked top 1 (maximize)
  • average rank of the lowest ranked homologous sequence (minimize)
  • root mean squared error (minimize)
  • average precision (average of the average precision in each block) (maximize)

Note that three of the measures (TOP 1, Average Last Rank, and average precision) depend only on the relative ordering of the matches within each block, not on the predicted values themselves.

We thank the author of the dataset, who wishes to remain anonymous until after the KDD-Cup.


Evaluation

Metrics for the Particle Physics Problem

Accuracy (maximize):

We use the usual definition of accuracy -- the number of cases predicted correctly, divided by the total number of cases. The predictions submitted for accuracy can be boolean or continuous. When submitting predictions, you will be asked for a prediction threshold: predictions above or equal to this threshold will be treated as class 1, predictions below threshold are treated as class 0. An accuracy of 1.00 is perfect prediction. Accuracy near 0.00 is poor.

*** Area Under the ROC Curve (maximize):

We use the usual definition for ROC Area (often abbreviated AUC). An ROC plot is a plot of true positive rate vs. false positive rate as the prediction threshold sweeps through all the possible values. If you are familiar with sensitivity and specificity, this is the same as plotting sensitivity vs. 1-specificity as you sweep the threshold. AUC is the area under this curve. AUC of 1 is perfect prediction -- all positive cases sorted above all negative cases. AUC of 0.5 is random prediction -- there is no relationship between the predicted values and truth. AUC below 0.5 indicates there is a relationship between predicted values and truth, but the model is backwards, i.e., predicts smaller values for positive cases! Another way to think of ROC Area is to imagine sorting the data by predicted values. Suppose this sort is not perfect, i.e., some positive cases sort below some negative cases. AUC effectively measures how many times you would have to swap cases with their neighbors to repair the sort.

AUC = 1.0 - (# of swaps to repair sort)/((# of positives) x (# of negatives))

There's a little more info on AUC in the comments at the top of the perf source code, but you probably don't need to look at it.

*** Cross-Entropy (minimize):

We use the usual definition for cross-entropy. Cross-entropy, like squared error, measures how close predicted values are to target values. Unlike squared error, cross-entropy assumes the predicted values are probabilities on the interval [0,1] that indicate the probability that the case is class 1.

Cross-Entropy = SUM [(Targ)*log(Pred) + (1-Targ)*log(1-Pred)]

where Targ is the Target Class (0 or 1) and Pred is the predicted probability that the case is class 1. Note that the terms (Targ) and (1-Targ) are alternately 0 or 1 so log(Pred) is added to the sum for positive cases and log(1-Pred) is added for negative cases. Note that cross entropy is infinity of the Targ is 0 and Pred is 1 or Targ is 1 and Pred is 0. In the code we protect against this by just returning a very, very large number (that may be platform dependent). If you get cross-entropies that are very large, you probably have at least one prediction that is like this.

To make cross-entropy independent of data set size, we use the mean cross-entropy, i.e., the sum of the cross-entropy for each case divided by the total number of cases.

*** SLAC Q-Score (maximize):

SLQ is a domain-specific performance metric devised by researchers at the Stanford Linear Accellerator (SLAC) to measure the performance of predictions made for certain kinds of particle physics problems. SLQ works with models that make continuous predictions on the interval [0-1]. It breaks this interval into a series of bins. For the KDD-CUP we are using 100 equally sized bins. The 100 bins are: 0.00-0.01, 0.01-0.02, ..., 0.98-0.99, 0.99-1.00.

PERF's SLQ option allows you to specify the number of bins. You should use 100 bins for the KDD-CUP competition. For example under unix:

perf -slq 100 < testdata

SLQ places predictions into the bins based on the predicted values. If your model predicts a value of 0.025 for a case, that case will be placed in the third bin 0.02-0.03. In each bin SLQ keeps track of the number of true positives and true negatives. SLQ is maximized if bins have high purity, e.g., if all bins contain all 0's or all 1's. This is unlikely, so SLQ computes a score based on how pure the bins are.

The score is computed as follows: suppose a bin has 350 positives and 150 negatives. The error of this bin if we predict positives (the predominant class) for all cases is 150/(350+150) = 0.30 = err. The contribution of this bin to the total SLQ score is (1-2*err)^2, times the total number of points in this bin divided by the total number of points in the data set. Dividing by the size of the total data set normalizes the score so that it is more independent of the size of the data set. Multiplying by the number of points in the bin gives more weight to bins that have more points, and less weight to bins with fewer points. The term (1-2*err)^2 is maximized (= 1) when all points in that bin are the same class, i.e., when the error due to predicting the predominant class is minimized. This is somewhat similar to measures of node purity in decision trees. The sum over the 100 bins is maximized when the contribution of each bin (weighted by the number of points in each bin) is maximized.

Collaborators at SLAC say that this is an important quantity for their work. They want to find models that maximize it. Other than that, we do not know exactly why the SLQ metric is defined this way. SLQ is a domain-specific performance metric custom designed for this particular class of problems.

Note that SLQ only cares about the purity in each node. A model would have poor accuracy and ROC below 0.5 if you switch the labels used for positive and negatives after training, but SLQ is insensitive to this.

Metrics for the Biology/Protein Problem

Unlike the particle physics problem, the biology problem comes in blocks. Performance is measured on each block independently, and then the average performance across the blocks is used as the final performance metric.

*** TOP1 (maximize):

The fraction of blocks with a true homolog (class 1) predicted as the best match. In each block the predictions are sorted by predicted value. If the case that sorts to the top (highest predicted value) is class 1, you score a 1 for that block. If the top case is class 0, you score 0. (If there are ties for the top case, you score 0 unless all of the tied cases are class 1. This means ties do not help.) The best TOP1 that can be achieved is to predict the highest value for a true homolog in each block. This would yield TOP1 = 1.0, perfect TOP1 prediction. TOP1 = 0.0 is poor.

*** Average Rank of Last Homolog (minimize):

Like TOP1, cases are sorted by predicted value. This metric measures how far down the sorted cases the last true homolog falls. A robust model would predict high values for all true homologs, so the last homolog would still sort high in the ordered cases. A rank of 1 means that the last homolog sorted to the top position. This is excellent, but can only happen if there is only 1 true homolog in the block. As with TOP1, ties do not help -- if cases are tied perf assumes the true homologs sort to the low end of the ties. It is not possible to achieve an average rank for the last homolog of 1 on this data because most blocks have more than one homolog. Average ranks near 1000 indicate poor performance.

*** Root-Mean-Squared-Error (minimze):

We use the usual defintion for squared error, and then divide it by the number of cases to make it independent of data set size, and then take the square root to convert it back to the natural scale of the targets and predictions.

RMSE = SQRT( 1/N * SUM (Targ-Pred)^2 )

For the KDD-CUP, the targets should be 0 and 1 and the predictions should be on the interval [0,1].

*** Mean Average Precision (maximize):

This is the metric that we got wrong in the first release of PERF. Hopefully you are using the newer release of PERF. One problem with average precision is that there are a variety of definitions in the literature. To further the confusion, we are using a relatively non-standard definition sometimes called "expected precision". If you are familiar with precision/recall plots, precision is the percentage of points predicted to be positive (above some threshold) that are positives. High precision is good. We sweep the threshold so that each case becomes predicted positive one at a time, calculate the precision at each of these thresholds, and then take the average of all of these precisions. This is kind of like the area under a precision recall plot, but not exactly. One problem with average precision is how to handle ties between positive and negative cases. For the KDD-CUP competition we developed a new method for handling ties that calculates the average precision of all possible orderings of the cases that are tied. Fortunately, we are able to do this without enumerating all possible orderings! Unfortuantely, it is still expensive when there are many ties (order n^2 in the number of ties), so the code backs off to simpler pessimistic calculation when there are thousands of ties. (You shouldn't run into this in the KDD-CUP since each block has only about 1000 cases.)

If you are not familiar with precision, there is a description of it in the comments at the top of the source code for PERF. But you probably don't need to look at it.

Average precision is measured is measured on each block, and then the mean of each block's average precision is used as the final metric. A mean average precision of 1 is perfect prediction. The lowest mean average precision possible is 0.0.

Competition Rules

Agreement: Participation in the contest constitutes the participant's full and unconditional agreement and acceptance of these rules.

Eligibility: The contest is open to any party planning to attend KDD 2004. Each of the tasks will be evaluated separately; you can enter in both tasks, in only one, or in only one particular performance measure. A person can participate in only one group per task. Multiple submissions per group are allowed for each task and performance measure, since we will not provide feedback on performance at the time of submission. Only the last submission before the deadline will be evaluated and all other submissions will be discarded.

Integrity: Only the data provided from this website can be used for the contest; use of external data constitutes cheating and is prohibited.

Winner Selection: There will be one winner for the "Quantum Physics Task" and one winner for the "Protein Homology Prediction Task". The winners will be determined according to the following method. All participants are ranked according to their performance on the test set for each task and performance measure. Not submitting a prediction for a performance measure will result in being ranked last. Winner of a task is the participant who has the best average rank over the set of performance measures for that task.

A honorable mention will be awarded for each task and performance measure. It will be given to the participant who is ranked highest for that particular task and performance measure.

Data Download

Download the data here (~63MB). My Note: Downloaded


Data Format

The file you downloaded is a TAR archive that is compressed with GZIP. Most decompression programs (e.g. winzip) can decompress these formats. If you run into problems, send us email. The archive should contain four files:

  • phy_train.dat: Training data for the quantum physics task (50,000 train cases)
  • phy_test.dat: Test data for the quantum physics task (100,000 test cases)
  • bio_train.dat: Training data for the protein homology task (145,751 lines)
  • bio_test.dat: Test data for the protein homology task (139,658 lines)

The file formats for the two tasks are as follows.

Format of the Quantum Physics Dataset

Each line in the training and the test file describes one example. The structure of each line is as follows:

  • The first element of each line is an EXAMPLE ID that uniquely describes the example. You will need this EXAMPLE ID when you submit results.
  • The second element is the class of the example. Positive examples are denoted by 1, negative examples by 0. Test examples have a "?" in this position. This is a balanced problem so the target values are roughly half 0's and 1's.
  • All following elements are feature values. There are 78 feature values in each line.
  • Missing values: columns 22,23,24 and 46,47,48 use a value of "999" to denote "not available", and columns 31 and 57 use "9999" to denote "not available". These are the column numbers in the data tables starting with 1 for the first column (the case ID numbers). If you remove the first two columns (the case ID numbers and the targets), and start numbering the columns at the first attribute, these are attributes 20,21,22, and 44,45,46, and 29 and 55, respectively. You may treat missing values any way you want, including coding them as a unique value, imputing missing values, using learning methods that can handle missing values, ignoring these attributes, etc.

The elements in each line are separated by whitespace.

Format of the Protein Homology Dataset

Each line in the training and the test file describes one example. The structure of each line is as follows:

  • The first element of each line is a BLOCK ID that denotes to which native sequence this example belongs. There is a unique BLOCK ID for each native sequence. BLOCK IDs are integers running from 1 to 303 (one for each native sequence, i.e. for each query). BLOCK IDs were assigned before the blocks were split into the train and test sets, so they do not run consecutively in either file.
  • The second element of each line is an EXAMPLE ID that uniquely describes the example. You will need this EXAMPLE ID and the BLOCK ID when you submit results.
  • The third element is the class of the example. Proteins that are homologous to the native sequence are denoted by 1, non-homologous proteins (i.e. decoys) by 0. Test examples have a "?" in this position.
  • All following elements are feature values. There are 74 feature values in each line. The features describe the match (e.g. the score of a sequence alignment) between the native protein sequence and the sequence that is tested for homology.
  • There are no missing values (that we know of) in the protein data.

To give an example, the first line in bio_train.dat looks like this:

279 261532 0 52.00 32.69 ... -0.350 0.26 0.76

279 is the BLOCK ID. 261532 is the EXAMPLE ID. The "0" in the third column is the target value. This indicates that this protein is not homologous to the native sequence (it is a decoy). If this protein was homologous the target would be "1". Columns 4-77 are the input attributes.

The elements in each line are separated by whitespace.

Results

Winners of KDD Cup 2004: Quantum Physics Problem
  • First Place:
    David S. Vogel, Eric Gottschalk, and Morgan C. Wang
    MEDai / A.I. Insight / University of Central Florida
    • Honorable Mention for ROC Area
    • Honorable Mention for Cross Entropy
    • Honorable Mention for SLQ Score
  • First Runner Up:
    Arpita Chowdhury, Dinesh Bharule, Don Yan, Lalit Wangikar
    Inductis Inc.
    • Honorable Mention for Accuracy
  • Second Runner Up:
    Christophe Lambert
    Golden Helix Inc.

Winners of KDD Cup 2004: Protein Homology Problem
  • First Place:
    Bernhard Pfahringer
    University of Waikato, Computer Science Department
  • Tied for 1st Place Overall:
    Yan Fu, RuiXiang Sun, Qiang Yang, Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao
    Institute of Computing Technology, Chinese Academy of Sciences
    • Honorable Mention for Squared Error
    • Honorable Mention for Average Precision
  • Tied for 1st Place Overall:
    David S. Vogel, Eric Gottschalk, and Morgan C. Wang
    MEDai / A.I. Insight / University of Central Florida
    • Honorable Mention for Top-1 Accuracy
  • Honorable Mention for Rank of Last:
    Dirk Dach, Holger Flick, Christophe Foussette, Marcel Gaspar, Daniel Hakenjos, Felix Jungermann, Christian Kullmann, Anna Litvina, Lars Michele, Katharina Morik, Martin Scholz, Siehyun Strobel, Marc Twiehaus, Nazif Veliu
    Artificial Intelligence Unit, University of Dortmund, Germany

Full Results: Quantum Physics Problem

 

Group Accuracy ROCA Cross Entropy SLQ Score Avg Rank
MEDai/AI Insight 0.73187 0.83054 0.70949 0.33280 1.333
Inductis 0.73255 0.82754 0.72456 0.32648 1.667
Golden Helix 0.72775 0.82250 0.73001 0.31749 3.333
Ahmad abdulKader 0.72744 0.81791 0.72798 0.30982 4.667
Salford Systems 0.72745 0.82109 0.74371 0.31447 5.333
Probing - JL&BZ 0.72522 0.81906 0.73634 0.31142 5.667
Andre and Tiny 0.72424 0.81412 0.74137 0.30217 7.333
FEG, Japan 0.72060 0.81572 0.73583 0.30591 7.667
191 0.72304 0.81252 0.74632 0.30410 9.333
14 0.72332 0.81208 0.75432 - 10.667
Tiberius 0.71884 0.81116 0.75301 0.29569 12.000
CoSCo * 0.71836 0.80952 0.74999 0.29307 13.000
UIUCSFP 0.71883 0.81028 0.75644 0.29441 13.333
159 0.71870 0.80829 0.75015 0.29176 13.667
408 0.71584 0.80699 0.75878 0.28814 18.333
64 0.71833 0.80310 0.75789 - 19.000
415 0.71790 0.80456 0.76745 0.27987 20.000
584 0.71633 0.80419 0.77181 - 22.000
Weka 0.71824 0.80458 0.78607 0.28297 22.333
Rueping * 0.71357 0.80428 0.76196 0.28440 22.667
7 0.71096 0.79646 0.74423 0.28441 24.333
167 0.71359 0.80445 0.78032 0.28332 24.667
347 0.72196 0.80934 1.742E73 0.29295 25.333
362 0.71234 0.80211 0.77623 0.28360 26.667
182 0.71311 0.80642 0.86928 0.28478 27.333
433 0.71141 0.79860 0.77967 - 28.000
27 0.71139 0.79779 0.78481 0.26959 29.667
585 0.70972 0.79748 0.77690 0.27092 29.667
agileumbrella 0.69863 0.79796 0.75798 0.27694 30.000
382 0.70663 0.79420 0.77459 0.26659 30.333
66 0.70588 0.79074 0.76813 0.25839 31.000
3 0.71560 0.80535 8.100E73 0.28437 33.667
586 0.71544 0.80582 9.000E73 0.28626 34.000
UIUCstat 0.70511 0.78783 0.79766 0.25245 35.667
8 0.70428 0.78959 0.79063 0.25722 35.667
117 0.70508 0.79257 0.83395 0.26410 36.000
Claudio Favre 0.71238 0.79303 1.449E73 0.26351 37.333
500 0.70299 0.78413 0.80406 0.24607 37.667
60 0.70577 0.77886 0.85489 0.23758 38.333
Jylin * 0.71832 0.71813 2.535E73 0.19088 39.333
138 0.69766 0.77626 0.83694 0.23265 40.667
Monash SBS 0.68778 0.78269 0.80625 0.23876 41.000
PG445 UniDo * 0.70426 0.77386 0.98868 0.22493 41.667
26 0.71438 0.71429 2.571E73 0.18385 42.333
276 0.69057 0.76975 0.87045 0.22243 44.000
13 0.71304 0.71284 2.583E73 - 44.667
42 0.66814 0.73973 0.85084 0.19766 46.000
Orrego-WVU 0.68649 0.75629 1.14441 0.20091 47.000
jacek 0.69006 0.76322 1.317E73 0.20832 48.000
219 0.69284 0.63626 1.63563 0.25298 49.667
409 0.67311 0.71333 1.18190 0.17636 49.667
153 0.70010 0.69997 2.699E73 0.16020 50.667
WizSoft 0.68107 0.68107 1.656E73 0.15401 52.333
518 0.57431 0.63547 1.14744 - 53.333
148 0.53276 0.54706 1.04035 0.00964 54.333
154 0.58207 0.59276 1.37861 0.08170 54.333
352 0.29094 0.31402 1.00119 - 55.667
187 0.49942 0.50224 1.170E73 0.24698 57.333
HKNN 0.62920 0.62918 3.337E73 0.06680 57.333
142 0.61752 0.63344 3.714E73 - 57.667
264 0.57565 0.52715 3.264E73 0.09919 59.000
206 - - - 0.26032 -
318 0.70858 - - - -
385 0.69455 0.77415 - - -
568 0.64672 - - - -

 
 

End of Table

Full Results: Protein Homology Problem

Group Top 1 RMSE RKL APR Avg Rank
Weka 0.90667 0.03833 52.44667 0.84091 4.125
ICT.AC.CN 0.91333 0.03501 54.08667 0.84118 4.500
MEDai/AI Insight 0.92000 0.03805 53.96000 0.83802 5.500
Mario Ziller * 0.90667 0.03766 55.00667 0.82422 7.125
Rong Pan 0.88667 0.03541 58.85333 0.82459 8.500
Probing - JL&BZ 0.89333 0.03952 52.42000 0.81931 9.125
560 0.88000 0.03826 53.24000 0.81344 11.125
285 0.88667 0.03923 53.30000 0.82066 11.500
PG445 UniDo * 0.86667 0.03878 45.62000 0.82995 11.750
587 0.88667 0.03692 64.58667 0.82006 11.750
206 0.90000 0.03941 59.11333 0.81883 13.000
591 0.87333 0.03830 52.84667 0.79938 14.000
575 0.87333 0.03838 53.06667 0.80187 14.625
584 0.88667 0.04097 61.71333 0.82420 15.500
513 0.87333 0.03848 53.37333 0.80298 16.000
276 0.89333 0.04135 59.80667 0.80672 16.125
541 0.87333 0.03847 53.20000 0.79629 16.750
539 0.86667 0.03850 52.90000 0.79941 16.875
540 0.86667 0.03850 53.26667 0.79560 20.125
14 0.88000 0.04541 68.37333 0.80706 20.875
504 0.88667 0.05182 60.86667 0.79783 21.875
Salford Systems 0.88000 0.03962 96.78667 0.80631 22.875
588 0.88667 0.05436 70.10667 0.80292 23.250
578 0.87333 0.04314 93.02667 0.81902 23.500
382 0.86667 0.04991 68.28667 0.80500 24.750
Martine Cadot 0.86667 0.04499 59.74000 0.79728 24.750
FEG, Japan 0.88000 0.03989 95.72667 0.79569 26.125
561 0.87333 0.03900 79.88667 0.77032 26.250
362 0.86000 0.04284 84.88667 0.80545 26.375
595 0.87333 0.05433 69.39333 0.79895 26.500
182 0.88667 0.09157 101.96667 0.79009 29.750
593 0.72667 0.05182 53.06667 0.64391 31.500
159 0.86000 0.16669 59.67333 0.79622 32.125
98 0.80000 0.05375 58.92667 0.73852 32.125
212 0.78667 0.05023 89.60667 0.71418 36.875
471 0.85333 0.10133 116.20667 0.77871 37.375
154 0.82000 0.15974 74.56000 0.78721 37.500
594 0.84000 0.26759 66.36667 0.76827 37.750
167 0.85333 0.10276 114.85333 0.76495 38.375
590 0.83333 0.13528 77.63333 0.76341 38.875
Pierron/Martino * 0.83333 0.05856 179.98667 0.70717 39.625
500 0.80000 8.062E4 86.94000 0.78668 41.625
3 0.64000 0.20492 75.38000 0.69035 43.000
398 0.82667 0.48079 154.10667 0.76865 43.750
WizSoft 0.78667 0.06701 557.98000 0.64550 44.375
589 0.72667 0.32017 94.35333 0.64391 45.750
544 0.52667 0.13206 364.07333 0.40547 46.500
548 0.52667 0.13206 364.07333 0.40547 46.500
545 0.52667 0.13329 375.52000 0.40466 48.500
cai cong zhong 0.52667 0.13329 375.52000 0.40466 48.500
26 0.07333 0.09668 416.32667 0.44952 48.750
546 0.52000 0.13337 389.48667 0.39686 50.750
554 0.52000 0.13337 389.48667 0.39686 50.750
ID16 0.46667 0.13241 398.24667 0.37974 51.500
581 0.46667 0.13241 398.24667 0.37974 51.500
264 0.02000 0.23782 114.82000 0.39570 51.625
187 0.01333 0.22880 810.53333 0.01727 56.250
Eric Group 0.02000 0.99076 851.46000 0.01453 57.625
476 0.68667 - - - -

 

 

End of Table

Frequently Asked Questions

What is an SLQ score?

SLQ is a domain-specific performance metric devised by researchers at the Stanford Linear Accellerator (SLAC) to measure the performance of predictions made for certain kinds of particle physics problems. SLQ works with models that make continuous predictions on the interval [0-1]. It breaks this interval into a series of bins. For the KDD-CUP we are using 100 equally sized bins. The 100 bins are: 0.00-0.01, 0.01-0.02, ..., 0.98-0.99, 0.99-1.00.

PERF's SLQ option allows you to specify the number of bins. You should use 100 bins for the KDD-CUP competition. For example under unix:

perf -slq 100 < testdata

SLQ places predictions into the bins based on the predicted values. If your model predicts a value of 0.025 for a case, that case will be placed in the third bin 0.02-0.03. In each bin SLQ keeps track of the number of true positives and true negatives. SLQ is maximized if bins have high purity, e.g., if all bins contain all 0's or all 1's. This is unlikely, so SLQ computes a score based on how pure the bins are.

The score is computed as follows: suppose a bin has 350 positives and 150 negatives. The error of this bin if we predict positives (the predominant class) for all cases is 150/(350+150) = 0.30 = err. The contribution of this bin to the total SLQ score is (1-2*err)^2, times the total number of points in this bin divided by the total number of points in the data set. Dividing by the size of the total data set normalizes the score so that it is more independent of the size of the data set. Multiplying by the number of points in the bin gives more weight to bins that have more points, and less weight to bins with fewer points. The term (1-2*err)^2 is maximized (= 1) when all points in that bin are the same class, i.e., when the error due to predicting the predominant class is minimized. This is somewhat similar to measures of node purity in decision trees. The sum over the 100 bins is maximized when the contribution of each bin (weighted by the number of points in each bin) is maximized.

Collaborators at SLAC say that this is an important quantity for their work. They want to find models that maximize it. Other than that, we do not know exactly why the SLQ metric is defined this way. SLQ is a domain-specific performance metric custom designed for this particular class of problems.

Note that SLQ only cares about the purity in each node. A model would have poor accuracy and ROC below 0.5 if you switch the labels used for positive and negatives after training, but SLQ is insensitive to this.

Hope this helps!

Is there a bug in how PERF calculates APR (average precision)?

Yes! A bug was found in how perf calculates precision and average precision (APR). The bug was that perf returned APR = 0 when there was only one true positive case in the data. We fixed perf (release 5.10 now) and made a few small improvements to perf's error handling. You only need to worry about this if you are working on the protein problem which uses average precision as one of the metrics. The problem does not affect any of the metrics used for the physics problem. Many thanks to the team who found this problem and pointed it out to us. We have implemented an improved algorithm for average precision. The fix is the same as the experimental code we released two weeks ago, so if you are using that code everything should be fine. Note that the way we define average precision for this competition is somewhat different than the way average precision is usually defined in information retrieval. In IR, average precision usually is the average of the precision at recalls of 0.0, 0.1, 0.2, ... 1.0, with a special rule for interpolating precisions to these recall values. For the KDD-CUP we are using a similar, but more precise method. Instead of simply calculating the precision at the 11 recalls mentioned above, we calculate the precision at every recall where it is defined. For each of these recall values, we find the threshold that produces the maximum precision, and take the average over all of the recall values greater than 0. This yields a lower-variance metric than the IR definition and thus should be easier to optimize to. We believe this new way of calculating APR is a significant improvement over the method typically used in IR.

What format will be used for submitting predictions?

Submissions are not due until July 14, but there have been questions about the submision format. The format we will use for submissions is exactly the same as the input format used by PERF, except that that targets should be replaced with the unique example ID numbers. We will then replace the ID numbers with the true targets, and run your predictions through PERF to calculate performances. Because there are four criteria for each problem (eight criteria total if you make submissions for both problems), you will submit four sets of predictions for each problem, one for each criterion. If you do not want to optimize learning for each criterion, you can submit the same predictions for multiple criteria. Thus you can submit the same predictions for Accuracy and ROC Area, and submit different predictions for Cross-entropy.

Are there missing values in the data?

The protein/bio data does not contain any missing values. The physics data does contain "missing" values. In the phy data set, columns 22,23,24 and 46,47,48 use a value of "999" to denote "not available", and columns 31 and 57 use "9999" to denote "not available". These are the column numbers in the data tables starting with 1 for the first column (the case ID numbers). If you remove the first two columns (the case ID numbers and the targets), and start numbering the columns at the first attribute, these are attributes 20,21,22, and 44,45,46, and 29 and 55, respectively. You may treat missing values any way you want, including coding them as a unique value, imputing missing values, using learning methods that can handle missing values, ignoring these attributes, etc.

How do I run PERF under windows?

I'm not very familiar with Windows, but I'll do my best. One way to use perf on windows is to open a cmd window (try typing "cmd" in "Run" under the "Start" menu). In the cmd window change to the directory perf is in using "cd" commands. Then execute perf.exe, specifying the options you want on the command line, and read from a file containing the targets and predictions. Here's a simple example from my windows machine using the directories you get if you unzip the windows download for perf:

C:\>cd c:\perf.windows

C:\perf.windows>dir

Directory of C:\perf.windows

05/04/2004 01:18p <DIR> .
05/04/2004 01:18p <DIR> ..
04/28/2004 06:04p 141,956 perf.exe
04/28/2004 07:10p <DIR> perf_sample_test_data
04/29/2004 01:15p 5,966 perf.1
2 File(s) 147,922 bytes
3 Dir(s) 15,593,144,320 bytes free

C:\perf.windows>perf.exe -acc -cxe -roc -slq 0.01 < perf_sample_test_data/testperfdata
ACC 0.83292 pred_thresh 0.500000
ROC 0.88380
SLQ 0.80851 Bin_Width 0.010000
CXE 0.57335

On the protein match problem, isn't it possible to always attain perfect TOP1 by assigning every test set to class "1"? Is there a restriction to prevent this?

To calculate TOP1, we sort cases by the values you predict for them. (If you prefer to sort the cases for us, just give us "predictions" that are the ranks you assign to each case so we get your ordering when we sort -- be sure to rank the case most likely to be class 1 with the *highest* rank (a number near 1000, not near 1).) If you predict all cases to be class 1, they all sort to the same position, so the homologs do not sort by themselves to the top 1 position. We will split ties in the ordering pessimistically. That is, if three cases tie for top 1 position, and only 1 (or 2) of these 3 cases are homologs (true class 1), then the TOP1 score will be 0 because 2 (or 1) of the cases predicted to be top 1 were not homologs. In other words, we are calculating TOP1 so that ties will hurt instead of help. This is appropriate for a score like TOP1 where you really have to predict a single case that is most likely to be a match.

Are ties always broken pessimistically?

No. The performance measures that depend on ordering the data are TOP1, Average Precision, ROC Area, Rank of the Last Positive, and the Slac Q-Score. Of these, only TOP1 and Rank of Last (RKL) break ties pessimistically. RKL, which returns the rank of the positive case last predicted to be positive, pessimistically assumes that the positive cases rank to the end of any group of cases they tie with. This is appropriate for a measure like RKL that measures how far down the list the model places a positive case. But the other ordering measures: Average Precision, ROC Area, and SLQ are not pessimistic or optimistic. They break ties by assigning fractional values to all cases that are tied. For example, if 10 cases are tied, and 3 of those cases are true class 1, these measures accumulate 3/10 = 0.3 points of "positiveness" as they include each point from the group of ties. There are other ways to break ties for these measures. This approach has the advantages that it is easy to understand, is neutral to ties, and is applicable to most performance measures that integrate performance over the full range of an ordering.

When will the tasks and the data be available?

The datasets and tasks will be made available at this WWW-site in the week of April 26, 2004.

Contacts

KDD Cup 2004 Chairs
  • Rich Caruana
    Department of Computer Science, Cornell University
  • Thorsten Joachims
    Department of Computer Science, Cornell University

Please visit the original KDD Cup 2004 website for more information.

KDD Cup 2003: Network mining and usage log analysis

This Year's Challenge

This year's competition focuses on problems motivated by network mining and the analysis of usage logs. Complex networks have emerged as a central theme in data mining applications, appearing in domains that range from communication networks and the Web, to biological interaction networks, to social networks and homeland security. At the same time, the difficulty in obtaining complete and accurate representations of large networks has been an obstacle to research in this area.

This KDD Cup is based on a very large archive of research papers that provides an unusually comprehensive snapshot of a particular social network in action; in addition to the full text of research papers, it includes both explicit citation structure and (partial) data on the downloading of papers by users. It provides a framework for testing general network and usage mining techniques, which will be explored via four varied and interesting task. Each task is a separate competition with its own specific goals.

The first task involves predicting the future; contestants predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference. For the second task, contestants must build a citation graph of a large subset of the archive from only the LaTex sources. In the third task, each paper's popularity will be estimated based on partial download logs. And the last task is open! Given the large amount of data, contestants can devise their own questions and the most interesting result is the winner.

About the Data

The e-print arXiv, initiated in Aug 1991, has become the primary mode of research communication in multiple fields of physics, and some related disciplines. It currently contains over 225,000 full text articles and is growing at a rate of 40,000 new submissions per year. It provides nearly comprehensive coverage of large areas of physics, and serves as an on-line seminar system for those areas. It serves 10 million requests per month, including tens of thousands of search queries per day. Its collections are a unique resource for algorithmic experiments and model building. Usage data has been collected since 1991, including Web usage logs beginning in 1993. On average, the full text of each paper was downloaded over 300 times since 1996, and some were downloaded tens of thousands of times.

The Stanford Linear Accelerator Center SPIRES-HEP database has been comprehensively cataloguing the High Energy Particle Physics (HEP) literature online since 1974, and indexes more than 500,000 high-energy physics related articles including their full citation tree.

Tasks

I. Citation Prediction

Goal

The goal of this task is to predict changes in the number of citations to individual papers over time.

Input

Contestants will be given:

  1. The LaTeX source of all papers in the hep-th portion of the arXiv through March 1, 2003. For each paper, this includes the main .tex file but not separate include files or figures. It also includes the hep-th arxiv number as a unique ID.
  2. The abstracts for all of the hep-th papers in the arXiv. For each paper the abstract file contains:
    • arXiv submission date
    • revised date(s)
    • title
    • authors
    • abstract
  3. The SLAC/SPIRES dates for all hep-th papers. Some older papers were uploaded years after their intial publication and the arXiv submission date from the abstracts may not correspond to the publication date. An alternative date has been provided from SLAC/SPIRES that may be a better estimate for the initial publication of these old papers.
  4. The complete citation graph for the hep-th papers, obtained from SLAC/SPIRES. Each node will be labeled by its unique ID from (1). Note that revised papers may have updated citations. As such, citations may refer to future papers, i.e. a paper may cite another paper that was published after the first paper.

Output

For each paper P in the collection, contestants should report the predicted difference between

  • the number of citations P will receive from hep-th papers submitted during the period May 1, 2003 - July 31, 2003, and
  • the number of citations P will receive from hep-th papers submitted during the period February 1, 2003 - April 30, 2003. (So if there were more citations during the period May 1, 2003 - July 31, 2003, then the prediction should be a positive number.)

The format for the submission is a simple 2 column vector of [arxiv id] [difference] sorted by arxiv id.

Update May 6, 2003: This difference does not need to be an integer; floating point numbers are valid predictions.

Evaluation

The target result is a vector V with one coordinate for each paper in the initial collection (1) that receives at least 6 citations during the period February 1, 2003 - April 30, 2003. The P-th coordinate of V will consist of the true difference in number of citations for paper P.

Based on a contestant's predictions, a vector W will be constructed, over the same set of paper; the P-th coordinate of W will consist of the predicted difference in number of citations for paper P.

The score of a prediction vector W will be equal to the L_1 difference between the vectors V and W.


II. Data Cleaning

Goal

It is often estimated that data cleaning is the most expensive and arduous tasks of the knowledge discovery process. Whether for industry or government databases, the problem of linking records and identifying identical objects in dirty data is a hard and costly task.

The goal of this task is to clean a very large set of real-life data: We would like to re-create the citation graph of about 35,000 papers in the hep-ph portion of the arXiv.

Input

Contestants will be given the LaTeX sources of all papers in the hep-ph portion of the arXiv on April 1, 2003. For each paper, this includes the main .tex file, but not separate include files or figures. The references in each paper have been "sanitized" through a script by removing all unique identifiers such as arXiv codes or other code numbers. No attempts have been made to repair any damages from the sanitization process. Each paper has been assigned a unique number.

Output

For each paper P in the collection, a list of other papers {P1, ..., Pk) in the collection such that P cites P1, ..., Pk. Note that P might cite papers that are not in the collection.

The format for submission is a plain ASCII file with 2 columns. The left column will be the citing-from paper id and the right column will be the cited-to paper id. The file should also be sorted.

Evaluation

The target is a graph G=(V,E) with each paper P a node in the graph, and each citation a directed edge in the graph. Assuming that a contestant submits a graph G'=(V,E'), the score is the size of the symmetric difference between E and E': |E-E'|.


III. Download Estimation

Goal

The goal of this task is to estimate the number of downloads that a paper receives in its first two months in the arXiv.

Input

Contestants will be given:

  1. all of the datasets available for Task 1: Citation Prediction.
  2. for papers published in the following months, the downloads received from the main site in each of its first 60 days in the arXiv.
    • February and March of 2000
    • February and April of 2001
    • March and April of 2002

Output

For each paper P submitted during the periods:

  • April 2000
  • March 2001
  • February 2002

contestants should report the estimated total number of downloads of P during its first 60 days in the arXiv. Note that this is a single number for each paper P, whereas the given data (3) provides a download log for the sixty days.

Evaluation

For each of the output periods (April 2000, March 2001, Feb 2002), the target result is a vector X with one coordinate for the top 50 papers with the greatest number of downloads in their first 60 days. For each of these papers P, the value of P-th coordinate is the number of downloads of P during its first 60 days.

Based on a contestant's download estimations, a vector Y will be constructed, over the same set of 150 papers (50 from each period); the P-th coordinate of Y will consist of the estimated number of downloads of P during its first 60 days.

The score of a prediction vector W will be equal to the L_1 difference between the vectors X and Y.


IV. Open Task

Goal

Contestants will be given the LaTeX sources of all papers in the hep-th portion of the arXiv on April 6, and the citation graph of the hep-th portion of the arXiv on that date.

For this "open task", the goal is to define as interesting a question as possible to ask on the data, and then to show the result of mining the data for the answer. The question addressed could be based on identifying an interesting structure, trend, or relationship in the data; posing further predictive tasks for the data; evaluating the performance of a novel algorithm on the data; or any of a number of other activities.

The results should be written up in the KDD submission format, using at most 10 pages. The write-up should cite and discuss relevant prior work. A committee of judges will select the winning entry, based on novelty, soundness of methods and evaluation, and relevance to the arXiv dataset.

Competition Rules

Agreement: Participation in the contest constitutes the participant's full and unconditional agreement and acceptance of these rules.

Eligibility: The contest is open to any party planning to attend KDD 2003. Each of the four tasks will be evaluated separately; you can enter as many tasks (or as few tasks) as you like. Only one submission per task per group is allowed. A person can participate in only one group per task.

Integrity: For Tasks I, only the data provided from this website can be used for the contest; use of external data for this task constitutes cheating and is prohibited.

In Task IV, use of external data is allowed.

For Tasks II, and III, our initial policy was to prohibit the use of external data. The intent of this was to prevent KDD Cup participants from designing solutions in which they explicitly make use of on-line resources that might be construed as containing partial solutions to the specified tasks. However, after considerable communication with Cup participants, we feel it is necessary to resolve the policy more finely, in a way that still preserves its initial intent: (1) The use of any bibliographic data, task-specific external data, or any other external resources specific to the task of indexing scientific literature, is prohibited. (2) However, the use of generic lexical resources -- that is, general resources about the English language, such as WordNet, general-purpose dictionaries and thesauri, and lists of stop-words -- is permitted. (3) If you are making use of external resources other than the examples specifically mentioned in (2), you must verify their eligibility with the KDD Cup chairs as soon as possible, and in no case later than July 1.

Winner Selection: Winners will be announced on August 15th, 2003 for each of the four different tasks. The evaluation criteria are discussed in each task description. Honorable mentions may also be awarded for noteworthy submissions.

Data

I. Citation Prediction Task

Available for contestants: 

  1. The LaTeX sources of all papers in the hep-th portion of the arXiv until May 1, 2003 are available for download. Each paper is identified by a unique arXiv id.

    There are approximately 29,000 hep-th papers with 1.7 gigs of data. The papers have been compressed to about 500M and divided into separate years for downloading.

    hep-th 1992 (22M)
    hep-th 1993 (31M)
    hep-th 1994 (39M)
    hep-th 1995 (36M)
    hep-th 1996 (41M)
    hep-th 1997 (44M)
    hep-th 1998 (45M)
    hep-th 1999 (48M)
    hep-th 2000 (53M)
    hep-th 2001 (56M)
    hep-th 2002 (59M)
    hep-th 2003 (17M)

  2. The abstracts for all the hep-th papers as a hep-th abstracts tarball.
  3. The SLAC dates for each hep-th paper as a hep-th slacdates tarball.
    • The format for the slac dates is a sorted 2 column vector where the left column is the paper's arxiv id and the right column is the SLAC date:
      [arxiv id] [date in YYYY-MM-DD format]
  4. The citation graph of the hep-th portion of the arXiv as a hep-th citations tarball.
    • The format for citations is a sorted 2 column vector where the left column is the cited from paper arxiv id and the right column is the cited to paper arxiv id:
      [paper cited from] [paper cited to]

II. Data Cleaning Task

For this task the LaTeX sources of the hep-ph papers on March 1, 2003 are available for download. A random paper id between 1 and 100,000 has been assigned to each paper. Also, a small subset of papers were converted from pdf/ps and only appear as plain text.

There are over 35,000 hep-ph papers with 1.8 gigs of data, so the download has been broken into 10 separate tar gzips of 50MB each, plus 1 extra tarball with the plain text papers.

hep-ph part 0
hep-ph part 1
hep-ph part 2
hep-ph part 3
hep-ph part 4
hep-ph part 5
hep-ph part 6
hep-ph part 7
hep-ph part 8
hep-ph part 9
hep-ph part 10 (plain text papers)

Sept 4, 2003: The corresponding citation graph for hep-ph used as the evaluation criteria is now available here.


III. Download Estimation Task

Available for this task are the same datasets for task 1 plus:

  • For each paper that was published in one of the listed six months (2/2000, 3/2000, 2/2001, 4/2001, 3/2002, 4/2002), the download logs from its first 60 days in the arXiv are provided.

Update Sept 4, 2003: Download data is no longer publicly available for download.


IV. Open Task

Contestants can use any of the hep-th data from Tasks 1 or 3.

Results

Competition overview by KDD Cup 2003 co-clairs [slides].

I. Citation Prediction Task

  • First Place: J N Manjunatha, Raghavendra Pandey, Sivaramakrishnan R., and M Narasimha Murty (1329)
  • First Runner Up: Claudia Perlich, Foster Provost, and Sofus Macskassy (1360) [slides].
  • Second Runner Up: David Vogel (1398)

The number in parentheses after each winner is the L_1 difference between the solution and the submission.

The solution for Task 1 is now available. The first column is the hep-th arxiv-id and the second column is (# of citations from May-July) - (# of citations from Feb-April) for all papers that received at least 6 citations between Feb and April.

In addition, the full list of new citations for all papers between May and July is also available.

II. Data Cleaning Task

  • First Place: David Vogel (421,582)
  • First Runner Up: Sunita Sarawagi, Kapil M. Bhudhia, Sumana Srinivasan, and V.G.Vinod Vydiswaran (516,242)
  • Second Runner Up: Martine Cadot and Joseph di Martino (538,013)

The number in parentheses after each winner is the size of the symmetric difference between the submission and the solution.

The solution for Task 2 is a citation graph provided by SLAC/SPIRES for hep-ph papers available as a zip file. Papers in the left column cite papers in the right column.

III. Download Estimation Task

  • First Place: Janez Brank and Jure Leskovec (21,232) [slides]
  • First Runner Up: Joseph Milana, Joseph Sirosh, Joel Carleton, Gabriela Surpi, Daragh Hartnett, and Michinari Momma (21,950.6)
  • Second Runner Up: Kohsuke Konishi (23,759)

The number in parentheses after each winner is the L_1 difference between the contestant's submission and the solution.

The actual download counts for the top 150 papers (50 from each of the three missing periods) are available here. The left column is the number of downloads the paper received in its first 60 days and the right column is the hep-th arxiv-id.

IV. Open Task

  • First Place: Amy McGovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen. "Exploiting Relational Structure to Understand Publication Patterns in High-Energy Physics" [slides]
  • First Runner Up: Shou-de Lin and Hans Chalupsky. "Using Unsupervised Link Discovery Methods to Find Interesting Facts and Connections in a Bibliography Dataset"
  • Second Runner Up: Shawndra Hill and Foster Provost "The Myth of the Double-Blind Review"

The submissions for Task 4 were evaluated by a small program committee consisting of the three KDD Cup 2003 co-chairs, Mark Craven (University of Wisconsin-Madison), David Page (University of Wisconsin-Madison), and Soumen Chakrabarti (Indian Institute of Technology Bombay).

General Questions

Is the use of external allowed for Tasks II and III?

For Tasks II, and III, our initial policy was to prohibit the use of external data. The intent of this was to prevent KDD Cup participants from designing solutions in which they explicitly make use of on-line resources that might be construed as containing partial solutions to the specified tasks.

However, after considerable communication with Cup participants, we feel it is necessary to resolve the policy more finely, in a way that still preserves its initial intent:

(1) The use of any bibliographic data, task-specific external data, or any other external resources specific to the task of indexing scientific literature, is prohibited.

(2) However, the use of generic lexical resources -- that is, general resources about the English language, such as WordNet, general-purpose dictionaries and thesauri, and lists of stop-words -- is permitted.

(3) If you are making use of external resources other than the examples specifically mentioned in (2), you must verify their eligibility with the KDD Cup chairs as soon as possible, and in no case later than July 1.


Task III (Download Estimation)

Are we allowed to use all the data available in Task 1 or just the Latex sources?

All of the datasets from Task 1 are available for use for Task 3. The task description for Task 3 has been updated to clarify this.

Is it acceptable to produce a vector of floating point numbers (i.e., representing the number of downloads for each paper), or are we required to output a vector of integers?

A vector of floating point number is acceptable. This has also been updated in the statement of the task.

The naming convention on the files that seems to correspond with the month and year of most submissions is the "hep-th arxiv number." It has no meaning for the purpose of this problem, aside from being a unique ID. Let me know if this represents a publication date or something key like that.

The hep-th/yymmnnn number is simply a sequential accession number nnn within the yr/month yymm that it was submitted to the hep-th arXiv. It is assigned at the time of hep-th submission.

The date that appears in all the abstracts after "Date: " is the "arXiv submission date." Is the date that the paper was submitted for publication?

No, this is the date it was deposited in the arXiv. It may have been submitted for journal publication before or after that date, though typically articles are submitted for publication shortly afterwards.

Revised dates. These only appear in some articles. I didn't see any connection to how we need to use them.

Some articles are later replaced with (a series of) revised versions, some are not. Revisions sometimes involve added references. The relevant date is usually the earliest date associated with any submission.

SLAC/SPIRES date. Our best estimate for when a paper was published. I am guessing it is included because it is the date we should use for determining the date of any given citation. The file containing these dates should list the arxiv numbers of every article we downloaded.

The SLAC/SPIRES date is sometimes a mysterious notion. Most often, it is a date shortly after the above arXiv received date, corresponding to when SLAC/SPIRES has downloaded the metadata. Sometimes it is a date long before the arXiv received date, which means that it is a pre-existing record corresponding to a back submission, e.g. a paper published in the 80's that someone has chosen to submit to hep-th during the 90s for historical or other purposes. In general, the earliest date associated to any given submission is typically the relevant one.

What is the L_1 difference between two vectors X and Y?

By the L_1 difference we mean the L_1 norm of X - Y, and hence the sum of the absolute values of the differences.


Task I (Citation Prediction)

Is it acceptable to produce a vector of floating point numbers (i.e., representing partial differences in the number of citations between time periods), or are we required to output a vector of integers?

A vector of floating point number is acceptable. This has also been updated in the statement of the task.

Contacts

KDD Cup 2003 Chairs

    • Johannes Gehrke
      Department of Computer Science, Cornell University
    • Paul Ginsparg
      Department of Physics and Computing and Information Science, Cornell University
    • Jon Kleinberg
      Department of Computer Science, Cornell University

Please visit the original KDD Cup 2003 website for more information.

KDD Cup 2002: BioMed document; plus gene role classification

This Year's Challenge

This year the competition included two tasks that involved data mining in molecular biology domains. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting. Both are described in more detail on the Tasks page.

Tasks

Task 1: Information Extraction from Biomedical Articles

Biomedical information exists in both the research literature and various structured databases. Some of these databases serve, in part, as a distillation of what is described in the literature. Such databases exist for genes and proteins in general, and also for more specific areas, such as the genome of a specific organism.

For this KDD Challenge Cup task, we examined the work performed by one group of curators for FlyBase (http://www.flybase.org/), a publically available database on the genetics and molecular biology of Drosophila (fruitflies). This task focused on helping to automate the work of curating biomedical databases by identifying what papers need to be curated for Drosophila gene expression information.

The Flybase criterion for curation for gene expression is:
Does the paper contain experimental evidence for gene expression, specifically, information about the gene products (RNA transcripts or polypeptides or proteins) associated with a given gene?

For the Challenge Task, we asked the contestants to develop a system that does the following:

Given:

  • A set of papers on genetics or molecular biology
  • For each paper, a list of the genes mentioned in that paper

Determine:

  • Whether the paper meets the Flybase gene-expression curation criteria, and for each gene, indicate whether the full paper has experimental evidence for gene products (RNA and/or protein).

Thus for each paper containing experimental evidence of gene expression, we asked that a system return a check-list for each gene indicating whether it has associated RNA and/or protein products.

Task 2: Yeast Gene Regulation Prediction

There are now experimental methods that allow biologists to measure some aspect of cellular "activity" for thousands of genes or proteins at a time. A key problem that often arises in such experiments is in interpreting or annotating these thousands of measurements. This KDD Cup task focused on using data mining methods to capture the regularities of genes that are characterized by similar activity in a given high-throughput experiment.

To facilitate objective evaluation, this task did not involve experiment interpretation or annotation directly, but instead it involved devising models that, when trained to classify the measurements of some instances (i.e. genes), can accurately predict the response of held aside test instances.

The training and test data came from recent experiments with a set of S. cerevisiae (yeast) strains in which each strain is characterized by a single gene being knocked out. Each instance in the data set represents a single gene, and the target value for an instance is a discretized measurement of how active some (hidden) system in the cell is when this gene is knocked out. The goal of the task is to learn a model that can accurately predict these discretized values. Such a model would be helpful in understanding how various genes are related to the hidden system.

A subset of the genes was held aside as a test set. For the Challenge Task, we asked the contestants to develop a system that does the following:

Given:

  • A list of test set genes
  • Various data sources describing the genes of interest

Determine:

  • For each test set gene, which "activity" class the strain with the gene knocked out falls into.

The data sources for this task include nominal (categorical) features describing gene localization and function, abstracts from the scientific literature (MEDLINE), and a table of protein-protein interactions that relate the products of pairs of genes.

The Hidden System

The experimental data (the target values) for this task were generated by Guang Yao and Dr. Chris Bradfield from the McArdle Laboratory for Cancer Research at the University of Wisconsin. During the KDD Cup competition, the nature of the system being measured was kept secret. Now it can be revealed...

Yao and Bradfield were measuring the activity of a yeast model of the AHR (Aryl Hydrocarbon Receptor) signaling pathway. This pathway plays a key role in how cells respond to certain toxic chemicals (among other things), and it is similar to pathways that control how cells respond to a variety of other environmental stimuli.

AHR is a protein that can act as a transcription factor. When a cell is exposed to certain toxic chemicals, such as dioxin, the AHR system acts to turn on/off the expression of certain genes. The complete inventory of gene products (proteins) that are involved in the signaling pathway is not known. The goal of Yao and Bradfield's experiment was to identify the set of genes that can play a role in the pathway.

Although the gene for AH receptor itself is not native to yeast, Yao and Bradfield transformed 4500+ strains in the yeast deletion library by inserting into each the AHR gene along with a reporter system that enabled them to measure how active AHR signaling was in any given strain.

The result of this experiment was the identification of 134 genes that, when knocked out, cause a significant change in the level of activity of the AHR signaling system.

Data Download

Task 1: Information Extraction from Biomedical Articles

Note: training and test data for this task is no longer available due to copyright concerns. See the mailing list archive for more information.

Task 2: Yeast Gene Regulation Prediction

Results

The December 2002 issue of SIGKDD Explorations will contain several articles desribing the KDD Cup tasks and the solutions developed by the top-finishing teams.

Task 1: Information Extraction from Biomedical Articles

Task overview [slides].

  • First Place: ClearForest and Celera 
    Yizhar Regev and Michal Finkelstein [slides]
  • Honorable Mentions:
    • Design Technology Institute Ltd., Department of Mechanical Engineering at the National University of Singapore and Genome Institute of Singapore
      Shi Min
    • Data Mining Group, Imperial College and Inforsense Limited
      Huma Lodhi and Yong Zhang
    • Verity Inc. and Exelixis, Inc.
      Bin Chen
Task 2: Yeast Gene Regulation Prediction

Task overview [slides].

  • First Place: Adam Kowalczyk and Bhavani Raskutti 
    Telstra Research Laboratories [slides]
  • Honorable Mentions:
    • David Vogel and Randy Axelrod
      A.I. Insight Inc. and Sentara Healthcare
    • Marcus Denecke, Mark-A. Krogel, Marco Landwehr and Tobias Scheffer
      Magdeburg University
    • George Forman
      Hewlett Packard Laboratories
    • Amal Perera, Bill Jockheck, Willy Valdivia Granda, Anne Denton, Pratap Kotala and William Perrizo
      North Dakota State University

General FAQs

Can I compete on just one task in KDD Cup?

Yes, the two tasks will be treated as separate competitions.

Do I need to do anything to register for KDD Cup?

Not at this point. You should, however, sign up for the appropriate mailing lists so that you are kept informed about any issues that come up.

Task 1 (Information Extraction) FAQs

For answers to questions about Task 1 see the mailing list archive.

Task 2 (Gene Regulation) FAQs

Will KDD Cup entrants be allowed to use other resources for training their models in addition to those provided in the training and test sets?

We have attempted to provide the most relevant data that is readily accessible. We ask that entrants not use other data sources in their models.

How big is the test set?

The test set consists of 1489 instances.

Is the ratio of the different classes the same in the test set as in the training set?

Yes.

Were the training and test set split apart randomly?

Yes, using a stratified method so that the class ratios would be the same in both sets.

Do the data files distributed with the training set describe the test set instances in addition to the training set instances?

Yes. The abstracts, protein-protein interactions, localization values, function values and aliases represent knowledge about all of the genes in yeast. The test set to be provided will consist solely of a list of gene identifiers. All of the information required to instantiate features for the test set instances is in the data files that were included with the training instances.

Are the MEDLINE abstracts meant to be used as input data?

Yes, in fact it is probably necessary to use them to get competitive accuracies.

Why do the abstracts often contain references to gene names followed by a "p". For example, abstract 10022848 references "sec4p" and "sec15p", but the file gene-abstracts.txt associates this abstract with the genes "sec4" and "sec15".

The "p" suffix is often used to refer to the protein encoded by a given gene. For example, "sec4p" denotes the protein encoded by the gene "sec4". Since the protein is the "product" of the gene, you can think of references to "sec4p" as saying something about "sec4".

Are some of the abstracts more relevant than others?

Yes, certainly. The range of relevance for the abstracts probably varies widely.

Is the relation represented by the protein-protein interaction table symmetric? Why are some pairs listed both ways, while most are not?

Yes, the relation is symmetric. Therefore the order of the pair in each row does not matter. Some pairs are listed in both orders simply because the original table had these redundancies and my code that tried to clean these up overlooked a few cases (it didn't check all aliases).

Why are some entries repeated in the protein-protein interaction table? Do the multiple entries have any significance?

This is the result of another overlooked data cleaning issue. There is no significance to the fact that some pairs have multiple entries.

Why are there some entries in the protein-protein interaction table in which a gene's product interacts with itself (i.e. the gene listed in the second column is the same as the one listed in the first)?

Certain proteins form homodimers, meaning that two copies of the same protein molecule bind to each other to form a complex. The instances of reflexive interaction (e.g. YNL331C, YNL331C) in the data set are putative homodimers.

If gene A's protein interacts with gene B's protein and gene B's protein interacts with gene C's protein, can we infer that A's protein interacts with C's protein?

No, the interaction relation is not transitive. You cannot conclude that A's protein physically interacts with C's protein. However, it may be reasonable to conclude that A and C are related in some way.

Is the list in interactions.txt exhaustive.

No, there are surely actual interactions that are not represented in the list. Moreover, there are surely some interactions in the list that are false positives.

Is is true that one protein may have multiple functions? If so, why does function.txt list only one function for each protein.

Yes, a given protein may have multiple functions. The most recent version of function.txt does list multiple functions. The original version was incomplete.

Is there clear (biological) independence between "functions" and the "protein classes".

No, in fact there is probably a high degree of dependence. But these two features do provide somewhat different views on the functions of various genes.

Is there noise in the class labels of the instances?

Since the class labels were determined via an experimental process in a lab, there is likely to be some noise in them. However, we haven't artificially added any noise to the labels.

Has any artificial noise been added to the data.

No.

Can you give an example of what is meant by a "hidden system" and a "control system"?

One example of a system we might measure is how well a particular virus replicates when various genes have been knocked out. Other examples include the activity levels of specific metabolic or signaling pathways. The motivation for having a "control system" in these experiments is to determine whether a given knockout seems to be specific in affecting our system of interest or whether it affects the functioning of the cell broadly.

In the gene-aliases.txt file, there are some aliases that are shared by several genes. Is this correct that multiple genes can share the same alias?

Yes. In some cases (e.g. ASP3) there are multiple copies of the same gene in the genome. In other cases (e.g. YFR1) the alias is sometimes loosely used to refer to any of of a family of six genes (YFR1-1, ... YFR1-6). In yet other cases (e.g. SAT2) the yeast community seems to have inadvertently used the same alias for multiple, unrelated genes. Rik Belew has provided a list of the overloaded gene aliases in the data set.

There are many more than 4507 referenced in the data set? Were the 4507 training and test genes selected randomly from the total complement of genes in yeast?

No, the 4507 strains that were measured in this experiment represent the strains that are viable when the gene associated with each is knocked out.

Where can I find background material on the the problem domain?

Is it appropriate to think of the control and hidden systems separately, such that there are 4 possible cases:

Hidden Change Control Change Class
0 0 NC
0 1 NC
1 0 CHANGE
1 1 CONTROL

If so, would there be any chance of getting refined codings, discriminating the two NC cases?

In theory there are four separate cases, but in practice the experiments were run as follows. Some subset of genes H was identified as having their knockouts significantly up/down-regulate the hidden system. Then the control system was measured only for the genes in H. So in effect we don't have the information to distinguish between the first two lines in the table above.

Can you give more detail about how the area under an ROC curve will be calculated in the evaluation?

Here is pseudocode for this calculation.

Given:
      predictions: n ordered test-set predictions
      total_pos: # positive instances in test set
      total_neg: # negative instances in test set

area = 0.0
tp_rate = 0.0
fp_rate = 0.0
tp_count = 0
fp_count = 0
i = 0
while (i < n && tp-rate < 1.0)
{
      // remember rates from last point
      last_fp_rate = fp_rate
      last_tp_rate = tp_rate

      // consider the next instance to be another pos prediction
      if class([predictions[i]) == pos
            ++tp_count
      else
            ++fp_count

      // determine coordinates of ROC point
      tp_rate = tp_count / total_pos
      fp_rate = fp_count / total_neg

      // update area
      if (fp-rate > last-fp-rate)
            // use trapezoid rule
            area += 0.5 * (fp_rate - last_fp_rate) *
                  (last_tp_rate + tp_rate)
}

// account for the rest of the area after tp_rate hits 1.0
area += (1 - fp_rate) * 1.0

Return: area

There are some genes in the test set (as well as the training set) for which there is no data available. How are we supposed to make predictions for these cases?

These cases represent (putative) genes for which the corresponding knockouts were actually used in the experiments that measured the hidden and control systems. These cases were not removed from the data set because they reflect the nature of the real problem at hand: little or nothing is known about some genes. It is up to each competitor to decide how to make predictions for these cases. One reasonable approach is to assume that they belong to the most populous class (nc).

Contacts

KDD Cup 2002 Chairs

  • Mark Craven
    Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison
  • Alexander Yeh
    The MITRE Corporation

Please visit the original KDD Cup 2002 website for more information.

KDD Cup 2001: Molecular bioactivity; plus protein locale prediction

This Year's Challenge

Because of the rapid growth of interest in mining biological databases, KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.

Task Description

KDD Cup 2001 involves 3 tasks, based on two data sets. The two training datasets are available from the links below, as zip files. The first dataset is a little over half a gigabyte when uncompressed and comes as a single text file, with one row per record and fields separated by commas. The second is a little over 7 megabytes uncompressed. It includes a single text file with all the data; again, the format is one row per record with comma-separated fields. But this data set is quite relational in nature, so improved accuracy may be possible by constructing more complex features or using a relational data mining technique (see the README file that comes with it). Nevertheless, we've tried to pre-compute some of the interesting relations as added fields, so that standard feature-vector algorithms can compete well. For both datasets, "names" files also are included that give the names of the fields; the names are "meaningful" only for the second dataset. For both datasets a README file is included that describes the nature of the task. The README files are repeated at the bottom of this page for those who wish to read about the data/task before choosing to download the data.

Evaluation

The following are the keys that were used for scoring. Several points are worth noting regarding Function and Localization keys. First, submissions varied widely in the use of punctuation, case, and spelling for function and localization names. Because of this variation, we decided to have our code remove punctuation and look at only a long enough prefix of a name to distinguish it from all others -- the name was then converted into a shorter standard form. These shorter forms are the ones given in the keys below. We also handchecked entries and converted forms. Second, one gene in the test set had two localizations (contradicting our earler statement that each gene had only one localization). For this gene, the predicted localization was counted correct if it matched *either* of the correct localizations. Third, one function appeared in a test set gene but in no training set gene. This of course made it impossible to get 100% accuracy, but everyone was subject to this same constraint, and we think it just goes with the territory of a real-world task :-)

Data Download

Training data

  • Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin My Note: Downloaded
  • Dataset 2: Prediction of Gene/Protein Function and Localization My Note: Downloaded

Test data

Data Descriptions

Description of Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin

Drugs are typically small organic molecules that achieve their desired activity by binding to a target site on a receptor. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind, followed by testing many small molecules for their ability to bind to the target site. This leaves researchers with the task of determining what separates the active (binding) compounds from the inactive (non-binding) ones. Such a determination can then be used in the design of new compounds that not only bind, but also have all the other properties required for a drug (solubility, oral absorption, lack of side effects, appropriate duration of action, toxicity, etc.).

The present training data set consists of 1909 compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting. The chemical structures of these compounds are not necessary for our analysis and are not included. Of these compounds, 42 are active (bind well) and the others are inactive. Each compound is described by a single feature vector comprised of a class value (A for active, I for inactive) and 139,351 binary features, which describe three-dimensional properties of the molecule. The definitions of the individual bits are not included - we don't know what each individual bit means, only that they are generated in an internally consistent manner for all 1909 compounds. Biological activity in general, and receptor binding affinity in particular, correlate with various structural and physical properties of small organic molecules. The task is to determine which of these properties are critical in this case and to learn to accurately predict the class value. To simulate the real-world drug design environment, the test set contains 636 additional compounds that were in fact generated based on the assay results recorded for the training set. In evaluating the accuracy, a differential cost model will be used, so that the sum of the costs of the actives will be equal to the sum of the costs of the inactives. In other words, it is just as important to minimize your error rate on the actives as it is to minimize your error rate on the inactives, even though the training set contains more inactive than actives (and the test set might also).

We thank DuPont Pharmaceuticals for graciously providing this data set for the KDD Cup 2001 competition. All publications referring to analysis of this data set should acknowledge DuPont Pharmaceuticals Research Laboratories and KDD Cup 2001.

Description of Dataset 2: Prediction of Gene/Protein Function and Localization

The genomes of several organisms have now been completely sequenced, including the human genome -- depending on one's definition of "completely" :-). Interest within bioinformatics is therefore shifting somewhat away from sequencing, to learning about the genes encoded in the sequence. Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another, in order to perform crucial functions. The present data set consists of a variety of details about the various genes of one particular type of organism. Gene names have been anonymized and a subset of the genes have been withheld for testing. The two tasks are to predict the functions and localizations of the proteins encoded by the genes. A gene/protein can have more than one function, but (at least in this data set) only one localization. The other information from which function and localization can be predicted includes the class of the gene/protein, the phenotype (observable characteristics) of individuals with a mutation in the gene (and hence in the protein), and the other proteins with which each protein is known to interact.

The full data set is in Full_File.data. But please notice that the task is quite "relational." For example, one might wish to learn a rule that says a gene G has function F if G interacts with another gene G1 that has function F. We have made an effort to build such features into Full_File.data. (For example, for each gene we give the number of interacting genes with a given function -- these features are probably useful for predicting at least one or two of the functions). But participants may wish to construct their own additional features or to use a relational data mining algorithm. While this certainly can be done from Full_File.data, it may be easier to do this from the relational tables that we used to build Full_File.data. These are in Genes_relation.data and Interactions_relation.data. Each of the data files has a corresponding names file as well.

Detailed knowledge of the biology should not be necessary for this application. This is so much the case that we almost even anonymized all the other fields as well as the gene field. But in the end we decided instead to leave the other fields alone, since this might make the data set more interesting. One word of caution: your predictor for function should not use localization, and your predictor for localization should not use function, since *both* fields will be withheld from the test genes when they are provided. Also note that, because a gene may have more than one function, we will test for correct prediction of every (gene, function) pair. By the time we provide the test data, we will provide full specification of the format for submission of your predictions.

Results

KDD Cup 2001 summary [sldies].

Winners of KDD Cup 2001: Task 1 - Thrombin

  • First Place: Jie Cheng
    Canadian Imperial Bank of Commerce [slides]
  • Honorable Mention: T. Silander
    University of Helsinki

Winners of KDD Cup 2001: Task 2 - Function

  • First Place: Mark-A. Krogel
    University of Magdeburg [slides]
  • Honorable Mentions:
    C. Lambert (Golden Helix)
    J. Sese, H. Hayashi, and S. Morishita (University of Tokyo)
    D. Vogel and R. Srinivasan (A.I. Insight)
    S. Pocinki, R. Wilkinson, and P. Gaffney (Lubrizol Corp.)

Winners of KDD Cup 2001: Task 3 - Localization

  • First Place: Hisashi Hayashi, Jun Sese, and Shinichi Morishita
    University of Tokyo [slides]
  • Honorable Mentions:
    M. Schonlau (RAND)
    W. DuMouchel, C. Volinsky and C. Cortes (AT & T)
    B. Frasca, Z. Zheng, R. Parekh, and R. Kohavi (Blue Martini)

Frequently Asked Questions

This page lists questions that were asked together with answers. Only questions of general interest appear here, without duplication; numerous other questions are omitted, as well as numerous other phrasings of questions that already appear here.

I would like to ask whether one can focus on the analysis of just one of the two dataset in order to participate to the KDD Cup.

Certainly. You may submit predictions for one, two, or all three tasks. Three winners will be identified, one for each of the three tasks: (1) prediction of active compounds for Dataset 1, (2) prediction of function for Dataset 2, (3) prediction of localization for Dataset 2.

Regarding Dataset 2, if the test data will contain the table similar to Interactions_relation will it contain data on interactions (test gene A - test gene B) in addition to interactions (test gene - learning gene) or the latter kind of interactions only?

The test data will contain interactions of both types, test genes with training genes and test genes with other test genes.

Regarding Dataset 1, can we expect that the ratio of the number of active compounds to the number of inactive compunds in the final test set to be roughly same as the ratio in the given training set?

No, and to keep this similar to the real-world scenario, we don't want to say what the ratio is. But being generous people :-), we will give away 2 items of information. (1) The compounds in the test set were made after chemists saw the activity results for the training set, so as you might expect the test set has a higher fraction of actives than did the training set. (2) Nevertheless, the test set still has more inactives than actives. We realize our method of testing makes the task tougher than it would be with the standard assumption that test data are drawn according to the same distribution as the training data. But this is a common scenario for those working in the pharmaceutical industry.

How will the entries be judged? Accuracy, speed of computation, conciseness of rule?

Entirely by test set accuracy, or 1 - error. But please note that for Dataset 1, error will be the sum of error on actual actives and error on actual inactives. Thus, for example, if there are 10 actives and 100 inactives in the test set, then each active will effectively count 10 times as much as each inactive.

How do we submit our entries? Do we ftp our algorithm to you or just simply the test results? What are the constraints? Processing window? Hardware?

You will send us simply the test results in a text file. We will specify the format of this file when we provide the test data. You may arrive at your predictions for the test data in any way you like (we impose no constraints).

What is an essential gene and what is a complex?

An essential gene is one without which the organism dies. Some proteins are complexes of several peptides (each encoded by a single gene). So if several genes have the same complex, it means they code for different parts of the same protein. Your data mining system should get good use of these fields without your having to give the fields any kind of special treatment based on domain knowledge. You may just treat them as nominal (discrete-valued) fields with the possible values listed in the Genes_relation.names file. The same is true for phenotype, class, motif, etc.

For Dataset 2, what is the meaning of the Corr (real-valued) field for two interacting genes.

This is the correlation between gene expression patterns for the two genes. A correlation far from 0 implies that these genes are likely to influence one another strongly.

For Dataset 1, assume that the test set contains NA active substances and NI inactive ones. My procedure classifies correctly NAcor of NA active substances and NIcor of NI inactive substances. Is it correct that the measure of error of my procedure would be Err = (NA - NAcorr)/NA + (NI - NIcor)/NI ? Is it Err I should minimize?

Yes. The person/group that minimizes this on the test set wins. This is equivalent to minimizing the ordinary error with differential costs. For example, if the test set contains 10 actives and 100 inactives, we want to maximize accuracy (minimize ordinary error) where each active counts 10 times but each inactive only once. This is the standard accuracy with differential misclassification costs as is used throughout data mining and machine learning.

I was looking at the Thrombin data set, and found that 593 of the 1909 samples are all zero (i.e., none of the bits are 1). Included in this set of 593 are two Active compounds. Is this correct, or could it be a mistake?

This is correct. Clearly one cannot get 100% training set accuracy because of this, but one probably can remove the two all-zero actives from the training set without incurring any disadvantage. I would not suggest removing any of the all-zero inactives, since their contributions to the frequencies of the different attributes are important (at least for frequency-based approaches such as decision/classification trees).

A small question w.r.t. dataset 2. Some attributes have missing values (now denoted by a '?'), but there are also 'Unknown' values. What is the difference between these?

There is no difference between 'unknown' and missing values. In some cases, the experimentalist assigned the particular gene to the class 'unknown'. In other cases where the class was not known it was left blank. In the data cleansing step both should be assigned to missing values.

Is there an error concerning gene G239017? I found two localizations given for this gene.

According to the classification scheme used, a gene can have multiple localizations. Actually the localizations refer to the gene product (protein) rather than the gene itself (e.g. a cytoplasmic transcription factor that moves into the nucleus given a specific signal). Nevertheless, almost all of the proteins have only one localization, and we will ensure that each protein in the test set has only one localization.

Just for completeness: will all functions and localizations of test examples come from the set of functions and localizations of training examples? There could be at least one different case, indicated to me by the occurrence of "TRANSPOSABLE ELEMENTS VIRAL AND PLASMID PROTEINS " in the file Genes_relation.names as one possible value for the function attribute, while I did not find this value for any training example.

It is possible that some cases are not represented in the training set. This reflects the real situation where genes of a given class have not been found or confirmed yet.

I am a bit puzzled by some issues of reflexive and symmetrical interaction relations contained in Interactions_relation.data. As far as I can see, there are 44 cases of genes interacting with themselves. (Why not all?) Moreover, I found 14 gene pairs, where gene#1 interacts with gene#2, and also gene#2 with gene#1. (Again, why are these cases just sporadic?) Could you maybe give some background information about these matters?

Interactions are not necessarily reflexive. Certain protein molecules bind to form homo-dimers. All interactions are symmetrical however. If gene1 interacts with gene2, then gene2 interacts with gene1. We have tried to list those interactions only once. However, in some cases both pairs made it to the final table.These should be considered as duplicate records.

On the thrombin data set, is there any particular order in the way 139351 binary features were generated?

No.

For function prediction, the README file gives as an example of a function '"Auxotrophies, carbon and"'. But this is not a function, it is a phenotype. What's up with this?

This was my (David's) mistake. I wanted to give an example that involved the double-quotes, but I mistakenly copied out the wrong string. I should have used as the example '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'. So don't let this confuse you. But don't worry if you omitted the double-quotes or comma (see next answer).

The double-quotes in some of the function names look odd. Are we supposed to include them? And what about the blanks at the end of some of the strings, such as in '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'? And what about commas? In the Genes_relation.data file, there was a comma in this function after CELL GROWTH. What should we do?

Use the function names as they appear in the .data file, e.g., use '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'. But actually we have written our scoring code so it can handle the case where you omit the double quotes, and so that it only looks at enough of the string to distinguish it from the other functions, so that no one should be penalized for differences in punctuation or decisions about trailing blanks, etc.

Regarding Dataset 2, you suggested using features which depend on functions/localizations of the genes that gene G interacts with. While I certainly can do this for the training set, it will be impossible to do this for test set, since I will not know the function/localization... Is there something I do not understand?

For any gene G in the test set, you will know function and localization for the training set genes with which G interacts. So if a test set gene G interacts with a training set gene G1 that has function F, then you might infer the test set gene G has function F. You can also "pull yourself up by your boostraps" as follows. If G interacts with another test set gene G2, and you have a high-confidence prediction that G2 has function F, you might infer that G has function F.

Will you weight the accuracy score for Task 2 and Task 3 in the same way you do for Task 1?

No. Let's go through these in detail, starting with Task 3 because it is easier. Every gene (actually, protein) in the test set has exactly one localization. For each gene, your prediction is either correct or incorrect. Accuracy is simply the fraction of localizations that are correctly predicted. (A non-prediction for a gene counts as an incorrect prediction.) The highest accuracy wins. Now let's go to Task 2. Because most proteins have multiple functions, we consider how many of the possible (protein,function) pairs are correctly predicted. If you include a (protein,function) pair that is known to be correct, i.e., that appears in our key, this is a true positive. If you include a pair that is not in our key, this is a false positive. If you fail to include a pair that appears in our key, this is a false negative. And you get credit for each pair that you do not predict that also does not appear in our key -- this is a true negative. The accuracy of your predictions is just the standard (true positive + true negative)/(true positive + true negative + false positive + false negative). It is worth noting that we very seriously considered using a second scoring function, a weighted accuracy, for this task. But we decided not to use a second function because we saw no compelling reason to assume that errors of omission are any more or less costly than errors of commission for this task.

Regarding Task 3, if my model fails to predict localization for some gene, how should I specify this?

Just don't include any entry for that gene in your results file. But because of the scoring function, you might as well just guess a localization for that gene (e.g., the localization that appears most often in the training set).

Regarding the test set for Dataset 2, if we use the composite variables with the number of interactions... On the training set, I assume that these variables only take a count of the number of interactions with training genes. What about on the test set? It seems they should count only the interactions with the training set genes, in order for the numbers to be comparable. If this is not the case, could you please create a version of the test set in which this is the case.

These variables (even when appearing in the test set) do indeed count only interactions with training set genes, in order to maintain consistency.

Regarding Task 1, in the first question period, you made reference to the fact that there would be more inactives than actives in the test set. I just want to make sure you're sticking by that statement, and I can count on that fact.

We're sticking by this statement -- we're using exactly the test set that we had in mind all along. There are more inactives than actives. But because the test set molecules were synthesized after the chemists looked at the activity levels of the training set molecules, you can expect there's a higher fraction of actives in the test set than in the training set. As we mentioned before, this makes matters tougher than under the fairly standard assumption that the test set is drawn according to the same distribution as the training set (or that it's a held-out set drawn randomly, uniformly, without replacement). But we're using it because, as stated before, it models the real world setting where this type of task arises. Can the data mining systems do better than the chemists alone (can they make a contribution that will be useful to the chemists)? If your predictions are strong on this test set, it indicates that your model would have been useful to the chemists in choosing the next round of compounds to make.

Contacts

KDD Cup 2001 Chairs

 

  • Christos Hatzis
    SilicoInsights
  • David Page
    University of Wisconsin

Please visit the original KDD Cup 2001 website for more information.

KDD Cup 2000: Online retailer website clickstream analysis

This Year's Challenge

The KDD Cup 2000 domain contains clickstream and purchase data from Gazelle.com, a legwear and legcare web retailer that closed their online store on 8/18/2000.

Tasks

Task Description

Question 1

Given a set of page views, will the visitor view another page on the site or will the visitor leave?

Question 2

Given a set of page views, which product brand will the visitor view in the remainder of the session?

Question 3

Given a set of purchases over a period of time, characterize visitors who spend more than $12 (order amount) on an average order at the site.

Question 4

Given a set of page views, characterize killer pages, i.e., pages after which users leave the site.

Question 5

Given a set of page views, characterize which product brand a visitor will view in the remainder of the session?

Data Download

The zip file contains the KDD Cup data with three real-world datasets, and three additional datasets for an association task based on the same data. Use of the data requires acknowledgement to Blue Martini Software (e.g., .We wish to thank Blue Martini Software for contributing the KDD Cup 2000 data.) and a reference to one of the following articles:
  1. For the KDD-CUP 2000: Ron Kohavi, Carla E. Brodley, Brian Frasca, Llew Mason, and Zijian Zheng. 2000. KDD-Cup 2000 organizers' report: peeling the onion. SIGKDD Explorations 2(2):86-98, December 2000. doi:10.1145/380995.381033
  2. For the association task: Zijian Zheng, Ron Kohavi, and Llew Mason. 2002. Real World Performance of Association Rule Algorithms. In ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '01). ACM, New York, NY, USA, 401-406. doi:10.1145/502512.502572

data file


data usage restrictions

Results

Summary talk [slides].

The reference to the KDD Cup 2001 is as follows:

Ron Kohavi, Carla Brodley, Brian Frasca, Llew Mason, and Zijian Zheng. KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2(2):86-98, 2000. [pdf]

Question 1 of KDD Cup 2000

  • First Place: Amdocs [paperposter]
  • Honorable Mentions:
    Mui Seng Martin Lee, Chong Jin Ong and S. Sathiya Keerthi of Mechanical and Production Engineering Department, National University of Singapore

Question 2 of KDD Cup 2000

  • First Place: Salford Systems, Inc
  • Honorable Mentions:
    MP13 team of Alexei Vopilov, Ivan Shabalin and Vladimir Mikheyev, and the team of Mukund Deshpande, George Karypis, Department of Computer Science and Engineering, University of Minnesota

Question 3 of KDD Cup 2000

  • First Place: Salford Systems, Inc
  • Honorable Mentions:
    Orit Rafaely, Tel-Aviv University and Amdocs

Question 4 of KDD Cup 2000

  • First Place: e-steam
  • Honorable Mentions:
    SAS, Amdocs, and LLSoft, Ltd

Question 5 of KDD Cup 2000

Contacts

KDD Cup 2000 Chairs
  • Carla Brodley
    School of Electrical and Computer Engineering , Purdue University
  • Ronny Kohavi
    http://www.kohavi.com   ronnyk a@t live dot com

Special thanks to Brian Frasca, Llew Mason, and Zijian Zheng from Blue Martini Software and Ben Bernstein from Gazelle.com Thanks to Acxiom for providing data enhancements.

KDD Cup 1999: Computer network intrusion detection

This Year's Challenge

The task for the classifier learning contest organized in conjunction with the KDD'99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network.

Task Description

Intrusion Detector Learning

Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections.

The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records.

A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.

Attacks fall into four main categories:

  • DOS: denial-of-service, e.g. syn flood;
  • R2L: unauthorized access from a remote machine, e.g. guessing password;
  • U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer overflow" attacks;
  • probing: surveillance and other probing, e.g., port scanning.

It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only.

Derived Features

Stolfo et al. defined higher-level features that help in distinguishing normal connections from attacks. There are several categories of derived features.

The "same host" features examine only the connections in the past two seconds that have the same destination host as the current connection, and calculate statistics related to protocol behavior, service, etc.

The similar "same service" features examine only the connections in the past two seconds that have the same service as the current connection.

"Same host" and "same service" features are together called time-based traffic features of the connection records.

Some probing attacks scan the hosts (or ports) using a much larger time interval than two seconds, for example once per minute. Therefore, connection records were also sorted by destination host, and features were constructed using a window of 100 connections to the same host instead of a time window. This yields a set of so-called host-based traffic features.

Unlike most of the DOS and probing attacks, there appear to be no sequential patterns that are frequent in records of R2L and U2R attacks. This is because the DOS and probing attacks involve many connections to some host(s) in a very short period of time, but the R2L and U2R attacks are embedded in the data portions of packets, and normally involve only a single connection.

Useful algorithms for mining the unstructured data portions of packets automatically are an open research question. Stolfo et al. used domain knowledge to add features that look for suspicious behavior in the data portions, such as the number of failed login attempts. These features are called "content" features.

A complete listing of the set of features defined for the connection records is given in the three tables below. The data schema of the contest dataset is available in machine-readable form.

Table 1: Basic features of individual TCP connections

 
ture name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment number of "wrong" fragments continuous
urgent number of urgent packets continuous

End of Table

Table 2: Content features within a connection suggested by domain knowledge

 
feature name description type
hot number of "hot" indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of "compromised" conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if "su root" command attempted; 0 otherwise discrete
num_root number of "root" accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete
is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete

End of Table

Table 3: Traffic features computed using a two-second time window
feature name description> type
count number of connections to the same host as the current connection in the past two seconds continuous
Note: The following features refer to these same-host connections.
serror_rate % of connections that have "SYN" errors continuous
rerror_rate % of connections that have "REJ" errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the current connection in the past two seconds continuous
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have "SYN" errors continuous
srv_rerror_rate % of connections that have "REJ" errors continuous
srv_diff_host_rate % of connections to different hosts continuous

End of Table

Evaluation

Each entry was scored against the corrected test data by a scoring awk script using the published cost matrix (see below) and the true labels of the test examples.

Data Download

This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Note: the training and test datasets are also available in the UC Irvine KDD archive.

Results

Winners of KDD Cup 1999

  • First Place: Bernhard Pfahringer
    Austrian Research Institute for Artificial Intelligence
  • First Runner Up: Itzhak Levin
    LLSoft, Inc. (using Kernel Miner)
  • Second Runner Up: Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin
    MP13 company, Moscow, Russia [details]

Contacts

KDD Cup 1999 Chair

  • Charles Elkan
    University of California, San Diego

Please read this page for more details about KDD Cup 1999 results.

KDD Cup 1998: Direct marketing for profit optimization

This Year's Challenge

The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.

Task Description

The data set for this year's Cup has been generously provided by the Paralyzed Veterans of America (PVA). PVA is a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease. With an in-house database of over 13 million donors, PVA is also one of the largest direct mail fund raisers in the country.

Participants in the CUP will demonstrate the performance of their tool by analyzing the results of one of PVA's recent fund raising appeals. This mailing was dropped in June 1997 to a total of 3.5 million PVA donors. It included a gift "premium" of personalized name & address labels plus an assortment of 10 note cards and envelopes. All of the donors who received this mailing were acquired by PVA through premium-oriented appeals like this. The analysis data set will include:

  • A subset of the 3.5 million donors sent this appeal
  • A flag to indicate respondents to the appeal and the dollar amount of their donation
  • PVA promotion and giving history
  • Overlay demographics, including a mix of household and area level data.

Unlike least year, all available information about the fields will be made available in the project documentation.

The objective of the analysis will be to identify response to this mailing - a classification or discrimination problem.

Evaluation

The CUP is aimed at recognizing the most accurate, innovative, efficient and methodologically advanced data mining tools in the marketplace.

The participants will again be evaluated based on the performance of their algorithm on the validation or hold-out data set. The KDD-CUP program committee will consider the following metrics in their evaluations:

  • Lift curve or gains table analysis listing the cumulative percent of targets recovered in the top quantiles of the file
  • Receiver operating characteristics (ROC) curve analysis and the area under the ROC curve
  • Several statistical tests to ensure the robustness of the results.

Last year, the performance in the top 10 percent of the file was considered as a measure of precision while the performance in the top 40 percent of the file was considered as a measure of stability and marketing coverage. The average performance up to the 40th percentile was also looked at as a measure of overall performance.

Data Download

Usage Notes

The KDD-CUP-98 data set and the accompanying documentation are now available for general use with the following restrictions:

  • If you intend to use this data set for training or educational purposes, you must not reveal the name of the sponsor PVA (Paralyzed Veterans of America) to the trainees or students. You are allowed to say "a national veterans organization"...

Information Files

  • readme. This list, listing the files in the FTP server and their contents.
  • instruct.txt . General instructions for the competition.
  • cup98doc.txt. This file, an overview and pointer to more detailed information about the competition.
  • cup98dic.txt. Data dictionary to accompany the analysis data set.
  • cup98que.txt. KDD-CUP questionnaire. PARTICIPANTS ARE REQUIRED TO FILL-OUT THE QUESTIONNAIRE and turn in with the results.
  • valtargt.readme. Describes the valtargt.txt file.

Information Files

  • cup98lrn.zip PKZIP compressed raw LEARNING data set. (36.5M; 117.2M uncompressed)
  • cup98val.zip PKZIP compressed raw VALIDATION data set. (36.8M; 117.9M uncompressed)
  • cup98lrn.txt.Z UNIX COMPRESSed raw LEARNING data set. (36.6M; 117.2M uncompressed)
  • cup98val.txt.Z UNIX COMPRESSed raw VALIDATION data set. (36.9M; 117.9M uncompressed)
  • valtargt.txt. This file contains the target fields that were left out of the validation data set that was sent to the KDD CUP 98 participants. (1.1M)

Note: the datasets are also available in the UC Irvine KDD archive.

Results

Winners of KDD Cup 1998

  • First Place:
    Urban Science Applications, Inc. with their software GainSmarts
  • First Runner Up:
    SAS Institute, Inc. with their software Enterprise Miner
  • Second Runner Up:
    Quadstone Limited with their software Decisionhouse

Contacts

KDD Cup 1998 Program Committee

  • Vasant Dhar
    New York University, New York, NY
  • Tom Fawcett
    Bell Atlantic, New York, NY
  • Georges Grinstein
    University of Massachusetts, Lowell, MA
  • Ismail Parsa
    Epsilon, Burlington, MA
  • Gregory Piatetsky-Shapiro
    Knowledge Stream Partners, Boston, MA
  • Foster Provost
    New York University, New York, NY
  • Kyusoek Shim
    Bell Laboratories, Murray Hill, NJ

KDD Cup 1997: Direct marketing for lift curve optimization

This Year's Challenge

This year, for the first time, the KDD 1997 Organization is organizing a Knowledge Discovery and Data Mining competition (KDD CUP 1997) in conjunction with the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997.)

The Cup is open to all KDDM tool vendors, academics and corporations with significant applications. All products, applications, research prototypes and black-box solutions are welcome. If requested, the anonymity of the participants and their affiliated companies / institutions will be preserved. Our aim is not to rank the participants but to recognize the most innovative, efficient and methodologically advanced KDDM tools.

This year's challenge is to predict who is most likely to donate to a charity. Contestants were evaluated on the accuracy on the validation data set.

Note: the data used in KDD Cup 1997 is exactly the same as KDD Cup 1998. Please read this page for more details. My Note: See below

 

Read More

  • Results My Note: This link no longer works.
  • Contacts My Note: This link no longer works.

KDD Cup 1998 Data

Source: http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

Abstract

This is the data set used for The Second International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-98 The Fourth International Conference on Knowledge Discovery and Data Mining. The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.

Usage Notes

The KDD-CUP-98 data set and the accompanying documentation are now available for general use with the following restrictions:

  1. The users of the data must notify Ismail Parsa (iparsa@epsilon.com) and Ken Howes (khowes@epsilon.com) in the event they produce results, visuals or tables, etc. from the data and send a note that includes a summary of the final result.
  2. The authors of published and/or unpublished articles that use the KDD-Cup-98 data set must also notify the individuals listed above and send a copy of their published and/or unpublished work.
  3. If you intend to use this data set for training or educational purposes, you must not reveal the name of the sponsor PVA (Paralyzed Veterans of America) to the trainees or students. You are allowed to say "a national veterans organization"...

For more information regarding the KDD-Cup (including the list of the participants and the results), please visit the KDD-Cup-98 web page at: http://www.epsilon.com/new. While there, scroll down to Data Mining Presentations where you will find the KDD-Cup-98 web page. My Note: Page Not Found

Ismail Parsa

Epsilon 
50 Cambridge Street
Burlington MA 01803 USA
TEL: (781) 685-6734 
FAX: (781) 685-0806 

Information files:

  • readme. This list, listing the files in the FTP server and their contents.
  • instruct.txt . General instructions for the competition.
  • cup98doc.txt. This file, an overview and pointer to more detailed information about the competition.
  • cup98dic.txt. Data dictionary to accompany the analysis data set. My Note: Downloaded
  • cup98que.txt. KDD-CUP questionnaire. PARTICIPANTS ARE REQUIRED TO FILL-OUT THE QUESTIONNAIRE and turn in with the results.
  • valtargt.readme. Describes the valtargt.txt file.

Data files:

  • cup98lrn.zip PKZIP compressed raw LEARNING data set. (36.5M; 117.2M uncompressed) My Note: Downloaded
  • cup98val.zip PKZIP compressed raw VALIDATION data set. (36.8M; 117.9M uncompressed) My Note: Downloaded
  • cup98lrn.txt.Z UNIX COMPRESSed raw LEARNING data set. (36.6M; 117.2M uncompressed)
  • cup98val.txt.Z UNIX COMPRESSed raw VALIDATION data set. (36.9M; 117.9M uncompressed)
  • valtargt.txt. This file contains the target fields that were left out of the validation data set that was sent to the KDD CUP 98 participants. (1.1M) My Note: Downloaded

 


The UCI KDD Archive
Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425

Last modified: 16 Feb 1999

NEXT

Page statistics
7372 view(s) and 59 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments