Data Science for Business Data Science

Last modified

Story

The Data Science for Business book has a section on Example: Attribute Selection with Information Gain in Chapter 3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation to illustrate the use of information gain with a simple but realistic dataset taken from the machine learning dataset repository at the University of California at Irvine. 4 It is a dataset describing edible and poisonous mushrooms taken from The Audubon Society Field Guide to North American Mushrooms.

The Mushroom Data Set, its Data Set Description, and Data Folder are captured below.

The steps for getting the dataset into Spotfire 6.0 are as follows:

Copied the data set in the Web page to a text file

Copied and pasted the text file into Spotfire 6.0

Note that the column definitions were missing so added them

Added the first row back that was used up to create the column definitions

Visualized the data set in the bar graph to verify the statistics on  Class Distribution: edible: 4208 (51.8%) and poisonous: 3916 (48.2%), with a total: 8124 instances.

My Note: This and the next tab  show why it is impossible to distinguish between poisonous and non-poisonous mushrooms.Then I wanted to re-create the following:

Similarly: Iris Data SetData Set Description, and Data Folder, and the Scotch Whiskies Data

The purpose is to follow The Data Mining Process:

  • Business Understanding:
    • Use real Subject Matter Expertise content instead of general Web content.
  • Data Understanding:
    • Make all content data so unstructured, semi-structured, and structure information are integrated data.
  • Data Preparation:
    • Create an index of content topics and objects that is both a relational and graph database.
  • Modeling:
    • A searchable Information Model with Analytics (Ontology) linked to the Thesaurus (Taxonomy) linked to the Glossary (Vocabulary).
  • Evaluation:
    • Finding more needles in the needle haystack and discovering things of interest that you did not know how to look for.
  • Deployment:
    • Publically available on the Web using the Google Chrome Browser.

So I have shown how one can go from a Book, to an acronym (TFIDF) in the Book that one wants to understand, to an Example: Attribute Selection with Information Gain that references the UC Irvine Machine Learning Repository page for the data set used to illustrate information gain.

MORE TO FOLLOW

Slides

Slides

Data Science for Business: Semantic Verses

http://semanticommunity.info 
http://www.meetup.com/Federal-Big-Data-Working-Group/  
http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

BrandNiemann02142014Slide1.PNG

Data Science for Business

BrandNiemann02142014Slide2.PNG

Magnet Text Analysis Engine: Understands What the Text is About

http://semanticverses.com/Default.aspx

BrandNiemann02142014Slide3.PNG


Data Science for Business Knowledge Base

http://semanticommunity.info/Data_Science/Data_Science_for_Business

BrandNiemann02142014Slide4.PNG 

MindTouch

BrandNiemann02142014Slide5.PNG

Specific Example

BrandNiemann02142014Slide6.PNG

Using Google Find for TFIDF 1

BrandNiemann02142014Slide7.PNG

Using Google Find for TFIDF 10

BrandNiemann02142014Slide8.PNG

The Data Mining Process 1

BrandNiemann02142014Slide9.PNG

The Data Mining Process 2

BrandNiemann02142014Slide10.PNG

Data Preparation

BrandNiemann02142014Slide11.PNG

Modeling

BrandNiemann02142014Slide12.PNG

Evaluation

http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#Find

BrandNiemann02142014Slide13.PNG

Deployment

 Web Player

BrandNiemann02142014Slide14.PNG

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

MindTouch

Control Panel

http://semanticommunity.info/deki/cp/

From Our Blog

http://www.mindtouch.com/blog/2014/0...indtouch-gong/

Inside MindTouch: Gong    

Jason Lazarski is gonging the expansion of MindTouch at Intuit and was also responsible for recently closing Teradata as a new customer.

http://www.mindtouch.com/blog/2013/1...thub_workflow/

Information and Resources

Stay tuned with our progress

MindTouch News Blog

http://www.mindtouch.com/blog/

MindTouch TCS Status Blog

Bookmark this site for regular information on MindTouch uptime and performance.
http://blog.mindtouch.us/

Developer guides

Developer guide

This guide provides resources for accessing the API, configuring components of MindTouch, customizing the CSS, using DekiScript, and accessing other advanced configurations.
http://help.mindtouch.us/01MindTouch...eveloper_Guide

DekiScript guide

MindTouch has a built in programming language called DekiScript that allows you to access system information and add business logic to your MindTouch articles. DekiScript is similar to javascript for some of the syntax and is executed within the page itself.

http://help.mindtouch.us/01MindTouch...ide/DekiScript

API documentation

http://help.mindtouch.us/01MindTouch.../API_Reference

This guide contains the MindTouch API documentation. Find the most popular, highest rated, and recently updated articles in the View Featured tab below. Browse articles organized by methods in the View by Tags tab, or browse them by title in the View All tab. You can also use the search bar to find articles by keywords.

Please review our API Frequently Asked Questions to get started and to address common questions. 

Skinning Guide

http://help.mindtouch.us/01MindTouch..._Look_and_Feel

This guide provides an overview of product features and related technologies. In addition, it contains recommendations on best practices, tutorials for getting started, and troubleshooting information for common situations.

Read the guides

Admin Guide

http://help.mindtouch.us/01MindTouch...istrator_Guide

This guide provides an overview of product features and related technologies for administering MindTouch.

Pro Member Guide

http://help.mindtouch.us/01MindTouch_4/User_Manual

This guide provides an overview of product features and related technologies. In addition, it contains recommendations on best practices, tutorials for getting started, and troubleshooting information for common situations.

IDF guide

http://help.mindtouch.us/01MindTouch...d_relationship

This will review how IDF templates are structured to encourage content creation and relationship.

IDF comes with 5 templates that are structured so you can easily build out documentation.  The 5 templates are as follows: User Guide, Feature Page, Troubleshooting Page, Reference Page, and Tutorial Page

MindTouch.Us Knowledge Base

http://help.mindtouch.us/01MindTouch_TCS/kb

This guide contains FAQs, solutions, and other information about the MindTouch TCS product. Find the most popular, highest rated, and recently updated articles in the View Featured tab below. Browse articles by tags in the View by Tags tab, or browse by title in the View All tab. You can also use the search bar to find articles by keywords.

MindTouch University

MindTouch Training
http://help.mindtouch.us/Support/Training

Not using MindTouch 4? Click here to view our MindTouch TCS training material
http://help.mindtouch.us/01MindTouch...h_TCS_Training

Center for Machine Learning and Intelligent Systems

Source: http://cml.ics.uci.edu/

Welcome to the Center for Machine Learning and Intelligent Systems at UC Irvine. Research in the fields of machine learning and intelligent systems addresses the fundamental problem of developing computer algorithms that can harness the vast amounts of digital data available in the 21st century and then use this data in an intelligent way to solve a variety of real-world problems. Examples of research activities in the Center range across areas as different as web search engines, statistical text mining, spam email filtering, information retrieval, automated reasoning, image and video data analysis, sensor networks, astronomy and planetary sciences, ocean and atmospheric sciences, systems biology, medical diagnosis, chemical informatics, and microarray genomics.

Research projects in the Center use theories and techniques from the intersection of computer science, statistics, and mathematics, including foundational ideas from algorithms, data structures, artificial intelligence, and databases (computer science), multivariate data analysis, Bayesian estimation, and computational statistics (from statistics), and optimization and probability theory (from mathematics). Research in the Center also has a very strong interdisciplinary component, including collaborations in areas ranging from sensors and ubiquitous computing, to databases and computer vision, to software engineering and Web applications. And projects involving collaborations with scientists are also numerous, including automated analysis of brain images, analysis of microarray gene expression data with microarrays, tracking storms in satellite data of the Earth's oceans, and many more.

We hope you find the Center's web site useful and encourage you to explore its contents.

About: http://archive.ics.uci.edu/ml/about.html

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged.

Many people deserve thanks for making the repository a success. Foremost among them are the donors and creators of the databases and data generators. Special thanks should also go to the past librarians of the repository: David Aha, Patrick Murphy, Christopher Merz, Eamonn Keogh, Cathy Blake, Seth Hettich, and David Newman.

Citation Policy: http://archive.ics.uci.edu/ml/citation_policy.html

Donate a Data Set: http://archive.ics.uci.edu/ml/donation_policy.html

Contact: http://archive.ics.uci.edu/ml/contact.html

Any questions about the site or a data set? Please direct all comments and inquiries to ml-repository '@' ics.uci.edu.

View ALL Data Sets: http://archive.ics.uci.edu/ml/datasets.html My Note: See Below

My Note 270 on February 10, 2014

Iris Data Set

Source: http://archive.ics.uci.edu/ml/datasets/Iris

DownloadData FolderData Set Description

Abstract

Famous database; from Fisher, 1936

Data Set Characteristics:  

Multivariate

Number of Instances:

150

Area:

Life

Attribute Characteristics:

Real

Number of Attributes:

4

Date Donated

1988-07-01

Associated Tasks:

Classification

Missing Values?

No

Number of Web Hits:

524460

 

Source

Creator: 

R.A. Fisher 

Donor: 

Michael Marshall (MARSHALL%PLU '@' io.arc.nasa.gov)

Data Set Information

This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 

Predicted attribute: class of iris plant. 

This is an exceedingly simple domain. 

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.

Attribute Information

1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 
5. class: 
-- Iris Setosa 
-- Iris Versicolour 
-- Iris Virginica

Relevant Papers

Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). 
[Web Link] 

Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. 
[Web Link] 

Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. 
[Web Link] 

Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. 
[Web Link] 

See also: 1988 MLC Proceedings, 54-64.

Papers That Cite This Data Set 1

Anthony K H Tung and Xin Xu and Beng Chin Ooi. CURLER: Finding and Visualizing Nonlinear Correlated Clusters. SIGMOD Conference. 2005. [View Context].

Igor Fischer and Jan Poland. Amplifying the Block Matrix Structure for Spectral Clustering. Telecommunications Lab. 2005. [View Context].

Sotiris B. Kotsiantis and Panayiotis E. Pintelas. Logitboost of Simple Bayesian Classifier. Informatica. 2005. [View Context].

Manuel Oliveira. Library Release Form Name of Author: Stanley Robson de Medeiros Oliveira Title of Thesis: Data Transformation For Privacy-Preserving Data Mining Degree: Doctor of Philosophy Year this Degree Granted. University of Alberta Library. 2005. [View Context].

Ping Zhong and Masao Fukushima. A Regularized Nonsmooth Newton Method for Multi-class Support Vector Machines. 2005. [View Context].

Jeroen Eggermont and Joost N. Kok and Walter A. Kosters. Genetic Programming for data classification: partitioning the search space. SAC. 2004. [View Context].

Remco R. Bouckaert and Eibe Frank. Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. PAKDD. 2004. [View Context].

Mikhail Bilenko and Sugato Basu and Raymond J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. ICML. 2004. [View Context].

Qingping Tao Ph. D. MAKING EFFICIENT LEARNING ALGORITHMS WITH EXPONENTIALLY MANY FEATURES. Qingping Tao A DISSERTATION Faculty of The Graduate College University of Nebraska In Partial Fulfillment of Requirements. 2004. [View Context].

Yuan Jiang and Zhi-Hua Zhou. Editing Training Data for kNN Classifiers with Neural Network Ensemble. ISNN (1). 2004. [View Context].

Sugato Basu. Semi-Supervised Clustering with Limited Background Knowledge. AAAI. 2004. [View Context].

Judith E. Devaney and Steven G. Satterfield and John G. Hagedorn and John T. Kelso and Adele P. Peskin and William George and Terence J. Griffin and Howard K. Hung and Ronald D. Kriz. Science at the Speed of Thought. Ambient Intelligence for Scientific Discovery. 2004. [View Context].

Jennifer G. Dy and Carla Brodley. Feature Selection for Unsupervised Learning. Journal of Machine Learning Research, 5. 2004. [View Context].

Ross J. Micheals and Patrick Grother and P. Jonathon Phillips. The NIST HumanID Evaluation Framework. AVBPA. 2003. [View Context].

Sugato Basu. Also Appears as Technical Report, UT-AI. PhD Proposal. 2003. [View Context].

Dick de Ridder and Olga Kouropteva and Oleg Okun and Matti Pietikäinen and Robert P W Duin. Supervised Locally Linear Embedding. ICANN. 2003. [View Context].

Aristidis Likas and Nikos A. Vlassis and Jakob J. Verbeek. The global k-means clustering algorithm. Pattern Recognition, 36. 2003. [View Context].

Zhi-Hua Zhou and Yuan Jiang and Shifu Chen. Extracting symbolic rules from trained neural network ensembles. AI Commun, 16. 2003. [View Context].

Jeremy Kubica and Andrew Moore. Probabilistic Noise Identification and Data Cleaning. ICDM. 2003. [View Context].

Julie Greensmith. New Frontiers For An Artificial Immune System. Digital Media Systems Laboratory HP Laboratories Bristol. 2003. [View Context].

Manoranjan Dash and Huan Liu and Peter Scheuermann and Kian-Lee Tan. Fast hierarchical clustering and its validation. Data Knowl. Eng, 44. 2003. [View Context].

Bob Ricks and Dan Ventura. Training a Quantum Neural Network. NIPS. 2003. [View Context].

Eibe Frank and Mark Hall. Visualizing Class Probability Estimators. PKDD. 2003. [View Context].

Jun Wang and Bin Yu and Les Gasser. Concept Tree Based Clustering Visualization with Shaded Similarity Matrices. ICDM. 2002. [View Context].

Michail Vlachos and Carlotta Domeniconi and Dimitrios Gunopulos and George Kollios and Nick Koudas. Non-linear dimensionality reduction techniques for classification and visualization. KDD. 2002. [View Context].

Geoffrey Holmes and Bernhard Pfahringer and Richard Kirkby and Eibe Frank and Mark A. Hall. Multiclass Alternating Decision Trees. ECML. 2002. [View Context].

Inderjit S. Dhillon and Dharmendra S. Modha and W. Scott Spangler. Class visualization of high-dimensional data with applications. Department of Computer Sciences, University of Texas. 2002. [View Context].

Manoranjan Dash and Kiseok Choi and Peter Scheuermann and Huan Liu. Feature Selection for Clustering - A Filter Solution. ICDM. 2002. [View Context].

Ayhan Demiriz and Kristin P. Bennett and Mark J. Embrechts. A Genetic Algorithm Approach for Semi-Supervised Clustering. E-Business Department, Verizon Inc.. 2002. [View Context].

Wai Lam and Kin Keung and Charles X. Ling. PR 1527. Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong. 2001. [View Context].

Jinyan Li and Guozhu Dong and Kotagiri Ramamohanarao and Limsoon Wong. DeEPs: A New Instance-based Discovery and Classification System. Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases. 2001. [View Context].

David Hershberger and Hillol Kargupta. Distributed Multivariate Regression Using Wavelet-Based Collective Data Mining. J. Parallel Distrib. Comput, 61. 2001. [View Context].

David Horn and A. Gottlieb. The Method of Quantum Clustering. NIPS. 2001. [View Context].

Asa Ben-Hur and David Horn and Hava T. Siegelmann and Vladimir Vapnik. A Support Vector Method for Clustering. NIPS. 2000. [View Context].

Neil Davey and Rod Adams and Mary J. George. The Architecture and Performance of a Stochastic Competitive Evolutionary Neural Tree Network. Appl. Intell, 12. 2000. [View Context].

Edgar Acuna and Alex Rojas. Ensembles of classifiers based on Kernel density estimators. Department of Mathematics University of Puerto Rico. 2000. [View Context].

Manoranjan Dash and Huan Liu. Feature Selection for Clustering. PAKDD. 2000. [View Context].

Carlotta Domeniconi and Jing Peng and Dimitrios Gunopulos. An Adaptive Metric Machine for Pattern Classification. NIPS. 2000. [View Context].

David M J Tax and Robert P W Duin. Support vector domain description. Pattern Recognition Letters, 20. 1999. [View Context].

Ismail Taha and Joydeep Ghosh. Symbolic Interpretation of Artificial Neural Networks. IEEE Trans. Knowl. Data Eng, 11. 1999. [View Context].

Wojciech Kwedlo and Marek Kretowski. Discovery of Decision Rules from Databases: An Evolutionary Approach. PKDD. 1998. [View Context].

Foster J. Provost and Tom Fawcett and Ron Kohavi. The Case against Accuracy Estimation for Comparing Induction Algorithms. ICML. 1998. [View Context].

Stephen D. Bay. Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. ICML. 1998. [View Context].

Igor Kononenko and Edvard Simec and Marko Robnik-Sikonja. Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF. Appl. Intell, 7. 1997. [View Context].

Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997. [View Context].

Ke Wang and Han Chong Goh. Minimum Splits Based Discretization for Continuous Features. IJCAI (2). 1997. [View Context].

Ethem Alpaydin. Voting over Multiple Condensed Nearest Neighbors. Artif. Intell. Rev, 11. 1997. [View Context].

Tapio Elomaa and Juho Rousu. Finding Optimal Multi-Splits for Numerical Attributes in Decision Tree Learning. ESPRIT Working Group in Neural and Computational Learning. 1996. [View Context].

Ron Kohavi. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. KDD. 1996. [View Context].

Daniel C. St and Ralph W. Wilkerson and Cihan H. Dagli. RULE SET QUALITY MEASURES FOR INDUCTIVE LEARNING ALGORITHMS. proceedings of the Artificial Neural Networks In Engineering Conference 1996 (ANNIE. 1996. [View Context].

Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI. 1995. [View Context].

Ron Kohavi. The Power of Decision Tables. ECML. 1995. [View Context].

George H. John and Ron Kohavi and Karl Pfleger. Irrelevant Features and the Subset Selection Problem. ICML. 1994. [View Context].

Zoubin Ghahramani and Michael I. Jordan. Learning from incomplete data. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES. 1994. [View Context].

Gabor Melli. A Lazy Model-Based Approach to On-Line Classification. University of British Columbia. 1989. [View Context].

Enes Makalic and Lloyd Allison and David L. Dowe. MML INFERENCE OF SINGLE-LAYER NEURAL NETWORKS. School of Computer Science and Software Engineering Monash University. [View Context].

Ron Kohavi and Brian Frasca. Useful Feature Subsets and Rough Set Reducts. the Third International Workshop on Rough Sets and Soft Computing. [View Context].

G. Ratsch and B. Scholkopf and Alex Smola and Sebastian Mika and T. Onoda and K. -R Muller. Robust Ensemble Learning for Data Mining. GMD FIRST, Kekul#estr. [View Context].

YongSeog Kim and W. Nick Street and Filippo Menczer. Optimal Ensemble Construction via Meta-Evolutionary Ensembles. Business Information Systems, Utah State University. [View Context].

Maria Salamo and Elisabet Golobardes. Analysing Rough Sets weighting methods for Case-Based Reasoning Systems. Enginyeria i Arquitectura La Salle. [View Context].

Lawrence O. Hall and Nitesh V. Chawla and Kevin W. Bowyer. Combining Decision Trees Learned in Parallel. Department of Computer Science and Engineering, ENB 118 University of South Florida. [View Context].

Anthony Robins and Marcus Frean. Learning and generalisation in a stable network. Computer Science, The University of Otago. [View Context].

Geoffrey Holmes and Leonard E. Trigg. A Diagnostic Tool for Tree Based Supervised Classification Learning Algorithms. Department of Computer Science University of Waikato Hamilton New Zealand. [View Context].

Shlomo Dubnov and Ran El and Yaniv Technion and Yoram Gdalyahu and Elad Schneidman and Naftali Tishby and Golan Yona. Clustering By Friends : A New Nonparametric Pairwise Distance Based Clustering Algorithm. Ben Gurion University. [View Context].

Michael R. Berthold and Klaus--Peter Huber. From Radial to Rectangular Basis Functions: A new Approach for Rule Learning from Large Datasets. Institut fur Rechnerentwurf und Fehlertoleranz (Prof. D. Schmid) Universitat Karlsruhe. [View Context].

Norbert Jankowski. Survey of Neural Transfer Functions. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Karthik Ramakrishnan. UNIVERSITY OF MINNESOTA. [View Context].

Wl/odzisl/aw Duch and Rafal Adamczak and Geerd H. F Diercksen. Neural Networks from Similarity Based Perspective. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Fernando Fern#andez and Pedro Isasi. Designing Nearest Neighbour Classifiers by the Evolution of a Population of Prototypes. Universidad Carlos III de Madrid. [View Context].

Asa Ben-Hur and David Horn and Hava T. Siegelmann and Vladimir Vapnik. A Support Vector Method for Hierarchical Clustering. Faculty of IE and Management Technion. [View Context].

Lawrence O. Hall and Nitesh V. Chawla and Kevin W. Bowyer. Decision Tree Learning on Very Large Data Sets. Department of Computer Science and Engineering, ENB 118 University of South Florida. [View Context].

G. Ratsch and B. Scholkopf and Alex Smola and K. -R Muller and T. Onoda and Sebastian Mika. Arc: Ensemble Learning in the Presence of Outliers. GMD FIRST. [View Context].

Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding. [View Context].

H. Altay G uvenir and Aynur Akkus. WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS. Department of Computer Engineering and Information Science Bilkent University. [View Context].

Huan Liu. A Family of Efficient Rule Generators. Department of Information Systems and Computer Science National University of Singapore. [View Context].

Rudy Setiono and Huan Liu. Fragmentation Problem and Automated Feature Construction. School of Computing National University of Singapore. [View Context].

Fran ois Poulet. Cooperation between automatic algorithms, interactive algorithms and visualization tools for Visual Data Mining. ESIEA Recherche. [View Context].

Takao Mohri and Hidehiko Tanaka. An Optimal Weighting Criterion of Case Indexing for Both Numeric and Symbolic Attributes. Information Engineering Course, Faculty of Engineering The University of Tokyo. [View Context].

Huan Li and Wenbin Chen. Supervised Local Tangent Space Alignment for Classification. I-Fan Shen. [View Context].

Adam H. Cannon and Lenore J. Cowen and Carey E. Priebe. Approximate Distance Classification. Department of Mathematical Sciences The Johns Hopkins University. [View Context].

A. da Valls and Vicen Torra. Explaining the consensus of opinions with the vocabulary of the experts. Dept. d'Enginyeria Informtica i Matemtiques Universitat Rovira i Virgili. [View Context].

Wl/odzisl/aw Duch and Rafal Adamczak and Krzysztof Grabczewski. Extraction of crisp logical rules using constrained backpropagation networks. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Eric P. Kasten and Philip K. McKinley. MESO: Perceptual Memory to Support Online Learning in Adaptive Software. Proceedings of the Third International Conference on Development and Learning (ICDL. [View Context].

Karol Grudzi nski and Wl/odzisl/aw Duch. SBL-PM: A Simple Algorithm for Selection of Reference Instances in Similarity Based Methods. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Chih-Wei Hsu and Cheng-Ru Lin. A Comparison of Methods for Multi-class Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University. [View Context].

Alexander K. Seewald. Dissertation Towards Understanding Stacking Studies of a General Ensemble Learning Scheme ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Naturwissenschaften. [View Context].

Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski and Grzegorz Zal. A hybrid method for extraction of logical rules from data. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Wl/odzisl/aw Duch and Rafal Adamczak and Geerd H. F Diercksen. Classification, Association and Pattern Completion using Neural Similarity Based Methods. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Stefan Aeberhard and Danny Coomans and De Vel. THE PERFORMANCE OF STATISTICAL PATTERN RECOGNITION METHODS IN HIGH DIMENSIONAL SETTINGS. James Cook University. [View Context].

Michael P. Cummings and Daniel S. Myers and Marci Mangelson. Applying Permuation Tests to Tree-Based Statistical Models: Extending the R Package rpart. Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland. [View Context].

Ping Zhong and Masao Fukushima. Second Order Cone Programming Formulations for Robust Multi-class Classification. [View Context].

Wl odzisl/aw Duch and Rafal Adamczak and Norbert Jankowski. Initialization of adaptive parameters in density networks. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Aynur Akku and H. Altay Guvenir. Weighting Features in k Nearest Neighbor Classification on Feature Projections. Department of Computer Engineering and Information Science Bilkent University. [View Context].

Jun Wang. Classification Visualization with Shaded Similarity Matrix. Bei Yu Les Gasser Graduate School of Library and Information Science University of Illinois at Urbana-Champaign. [View Context].

Andrew Watkins and Jon Timmis and Lois C. Boggess. Artificial Immune Recognition System (AIRS): An ImmuneInspired Supervised Learning Algorithm. (abw5,jt6@kent.ac.uk) Computing Laboratory, University of Kent. [View Context].

Gaurav Marwah and Lois C. Boggess. Artificial Immune Systems for Classification : Some Issues. Department of Computer Science Mississippi State University. [View Context].

Igor Kononenko and Edvard Simec. Induction of decision trees using RELIEFF. University of Ljubljana, Faculty of electrical engineering & computer science. [View Context].

Daichi Mochihashi and Gen-ichiro Kikui and Kenji Kita. Learning Nonstructural Distance Metric by Minimum Cluster Distortions. ATR Spoken Language Translation research laboratories. [View Context].

Wl odzisl/aw Duch and Karol Grudzinski. Prototype based rules - a new way to understand the data. Department of Computer Methods, Nicholas Copernicus University. [View Context].

H. Altay Guvenir. A Classification Learning Algorithm Robust to Irrelevant Features. Bilkent University, Department of Computer Engineering and Information Science. [View Context].

Citation Request

Please refer to the Machine Learning Repository's citation policy


[1] Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML

Data Set Description

Source: http://archive.ics.uci.edu/ml/machin...ris/iris.names

1. Title: Iris Plants Database
	Updated Sept 21 by C.Blake - Added discrepency information

2. Sources:
     (a) Creator: R.A. Fisher
     (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
     (c) Date: July, 1988

3. Past Usage:
   - Publications: too many to mention!!!  Here are a few.
   1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"
      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
      to Mathematical Statistics" (John Wiley, NY, 1950).
   2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
      Structure and Classification Rule for Recognition in Partially Exposed
      Environments".  IEEE Transactions on Pattern Analysis and Machine
      Intelligence, Vol. PAMI-2, No. 1, 67-71.
      -- Results:
         -- very low misclassification rates (0% for the setosa class)
   4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE 
      Transactions on Information Theory, May 1972, 431-433.
      -- Results:
         -- very low misclassification rates again
   5. See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al's AUTOCLASS II
      conceptual clustering system finds 3 classes in the data.

4. Relevant Information:
   --- This is perhaps the best known database to be found in the pattern
       recognition literature.  Fisher's paper is a classic in the field
       and is referenced frequently to this day.  (See Duda & Hart, for
       example.)  The data set contains 3 classes of 50 instances each,
       where each class refers to a type of iris plant.  One class is
       linearly separable from the other 2; the latter are NOT linearly
       separable from each other.
   --- Predicted attribute: class of iris plant.
   --- This is an exceedingly simple domain.
   --- This data differs from the data presented in Fishers article
	(identified by Steve Chadwick,  spchadwick@espeedaz.net )
	The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa"
	where the error is in the fourth feature.
	The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa"
	where the errors are in the second and third features.  

5. Number of Instances: 150 (50 in each of three classes)

6. Number of Attributes: 4 numeric, predictive attributes and the class

7. Attribute Information:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm
   5. class: 
      -- Iris Setosa
      -- Iris Versicolour
      -- Iris Virginica

8. Missing Attribute Values: None

Summary Statistics:
	         Min  Max   Mean    SD   Class Correlation
   sepal length: 4.3  7.9   5.84  0.83    0.7826   
    sepal width: 2.0  4.4   3.05  0.43   -0.4194
   petal length: 1.0  6.9   3.76  1.76    0.9490  (high!)
    petal width: 0.1  2.5   1.20  0.76    0.9565  (high!)

9. Class Distribution: 33.3% for each of 3 classes.

Folder

Source: http://archive.ics.uci.edu/ml/machin...atabases/iris/

Index

Source: http://archive.ics.uci.edu/ml/machin...ses/iris/Index

03-Dec-1996 04:01    105

Index of iris

02 Dec 1996      105 Index
08 Mar 1993     4551 iris.data
30 May 1989     2604 iris.names
bezdekIris.data

Source: http://archive.ics.uci.edu/ml/machin...ezdekIris.data

14-Dec-1999 12:12    4.4K

My Note: Saved as text file for import to Spotfire

iris.data

Source: http://archive.ics.uci.edu/ml/machin...iris/iris.data

08-Mar-1993 16:27    4.4K

My Note: Saved as text file for import to Spotfire

iris.names

Source: http://archive.ics.uci.edu/ml/machin...iris/iris.data

11-Jul-2000 21:30    2.9K

My Note: Same as Data Set Description Above

Mushroom Data Set

Source: http://archive.ics.uci.edu/ml/datasets/Mushroom

Mushroom Data Set

DownloadData FolderData Set Description

Abstract

From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible

Data Set Characteristics:  

Multivariate

Number of Instances:

8124

Area:

Life

Attribute Characteristics:

Categorical

Number of Attributes:

22

Date Donated

1987-04-27

Associated Tasks:

Classification

Missing Values?

Yes

Number of Web Hits:

112323

 

Source

Origin: Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf 

Donor: Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)

Data Set Information

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

Attribute Information

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
4. bruises?: bruises=t,no=f 
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
6. gill-attachment: attached=a,descending=d,free=f,notched=n 
7. gill-spacing: close=c,crowded=w,distant=d 
8. gill-size: broad=b,narrow=n 
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
10. stalk-shape: enlarging=e,tapering=t 
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
16. veil-type: partial=p,universal=u 
17. veil-color: brown=n,orange=o,white=w,yellow=y 
18. ring-number: none=n,one=o,two=t 
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

Relevant Papers

Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine. 
[Web Link] 

Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann. 
[Web Link] 

Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30,[Web Link] 
[Web Link] 

Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997. 
[Web Link]

Papers That Cite This Data Set1

Manuel Oliveira. Library Release Form Name of Author: Stanley Robson de Medeiros Oliveira Title of Thesis: Data Transformation For Privacy-Preserving Data Mining Degree: Doctor of Philosophy Year this Degree Granted. University of Alberta Library. 2005. [View Context].

Hyunsoo Kim and Se Hyun Park. Data Reduction in Support Vector Machines by a Kernelized Ionic Interaction Model. SDM. 2004. [View Context].

Xiaoyong Chai and Li Deng and Qiang Yang and Charles X. Ling. Test-Cost Sensitive Naive Bayes Classification. ICDM. 2004. [View Context].

Daniel J. Lizotte and Omid Madani and Russell Greiner. Budgeted Learning of Naive-Bayes Classifiers. UAI. 2003. [View Context].

Daniel Barbar and Yi Li and Julia Couto. COOLCAT: an entropy-based algorithm for categorical clustering. CIKM. 2002. [View Context].

Stephen D. Bay and Michael J. Pazzani. Detecting Group Differences: Mining Contrast Sets. Data Min. Knowl. Discov, 5. 2001. [View Context].

Jinyan Li and Guozhu Dong and Kotagiri Ramamohanarao and Limsoon Wong. DeEPs: A New Instance-based Discovery and Classification System. Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases. 2001. [View Context].

Huan Liu and Hongjun Lu and Jie Yao. Toward Multidatabase Mining: Identifying Relevant Databases. IEEE Trans. Knowl. Data Eng, 13. 2001. [View Context].

Jinyan Li and Guozhu Dong and Kotagiri Ramamohanarao. Instance-Based Classification by Emerging Patterns. PKDD. 2000. [View Context].

Farhad Hussain and Huan Liu and Einoshin Suzuki and Hongjun Lu. Exception Rule Mining with a Relative Interestingness Measure. PAKDD. 2000. [View Context].

Kiri Wagstaff and Claire Cardie. Clustering with Instance-level Constraints. ICML. 2000. [View Context].

Mark A. Hall and Lloyd A. Smith. Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper. FLAIRS Conference. 1999. [View Context].

Jinyan Li and Xiuzhen Zhang and Guozhu Dong and Kotagiri Ramamohanarao and Qun Sun. Efficient Mining of High Confidience Association Rules without Support Thresholds. PKDD. 1999. [View Context].

Seth Bullock and Peter M. Todd. Made to Measure: Ecological Rationality in Structured Environments. Center for Adaptive Behavior and Cognition Max Planck Institute for Human Development. 1999. [View Context].

Venkatesh Ganti and Johannes Gehrke and Raghu Ramakrishnan. CACTUS - Clustering Categorical Data Using Summaries. KDD. 1999. [View Context].

Ismail Taha and Joydeep Ghosh. Symbolic Interpretation of Artificial Neural Networks. IEEE Trans. Knowl. Data Eng, 11. 1999. [View Context].

Mark A. Hall. Department of Computer Science Hamilton, NewZealand Correlation-based Feature Selection for Machine Learning. Doctor of Philosophy at The University of Waikato. 1999. [View Context].

Huan Liu and Hongjun Lu and Ling Feng and Farhad Hussain. Efficient Search of Reliable Exceptions. PAKDD. 1999. [View Context].

Huan Liu and Rudy Setiono. Incremental Feature Selection. Appl. Intell, 9. 1998. [View Context].

Robert M French. Pseudo-recurrent connectionist networks: An approach to the "sensitivity-stability" dilemma.. Connection Science. 1997. [View Context].

Nicholas Howe and Claire Cardie. Examining Locally Varying Weights for Nearest Neighbor Algorithms. ICCBR. 1997. [View Context].

Huan Liu and Rudy Setiono. A Probabilistic Approach to Feature Selection - A Filter Solution. ICML. 1996. [View Context].

Kamal Ali and Michael J. Pazzani. Error Reduction through Learning Multiple Descriptions. Machine Learning, 24. 1996. [View Context].

Guszti Bartfai. VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui. Department of Computer Science PO Box 600. 1996. [View Context].

Geoffrey I. Webb. OPUS: An Efficient Admissible Algorithm for Unordered Search. J. Artif. Intell. Res. (JAIR, 3. 1995. [View Context].

Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California. [View Context].

Eric P. Kasten and Philip K. McKinley. MESO: Perceptual Memory to Support Online Learning in Adaptive Software. Proceedings of the Third International Conference on Development and Learning (ICDL. [View Context].

Stefan R uping. A Simple Method For Estimating Conditional Probabilities For SVMs. CS Department, AI Unit Dortmund University. [View Context].

Josep Roure Alcobe. Incremental Hill-Climbing Search Applied to Bayesian Network Structure Learning. Escola Universitria Politcnica de Mataro. [View Context].

Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski and Grzegorz Zal. A hybrid method for extraction of logical rules from data. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Jinyan Li and Kotagiri Ramamohanarao and Guozhu Dong. ICML2000 The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms. Department of Computer Science and Software Engineering, The University of Melbourne, Parkville. [View Context].

Wl/odzisl/aw Duch and Rafal Adamczak and Krzysztof Grabczewski. Extraction of crisp logical rules using constrained backpropagation networks. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding. [View Context].

C. Titus Brown and Harry W. Bullen and Sean P. Kelly and Robert K. Xiao and Steven G. Satterfield and John G. Hagedorn and Judith E. Devaney. Visualization and Data Mining in an 3D Immersive Environment: Summer Project 2003. [View Context].

Daniel J. Lizotte. Library Release Form Name of Author. Budgeted Learning of Naive Bayes Classifiers. [View Context].

David R. Musicant. DATA MINING VIA MATHEMATICAL PROGRAMMING AND MACHINE LEARNING. Doctor of Philosophy (Computer Sciences) UNIVERSITY. [View Context].

Sherrie L. W and Zijian Zheng. A BENCHMARK FOR CLASSIFIER LEARNING. Basser Department of Computer Science The University of Sydney. [View Context].

Anthony Robins and Marcus Frean. Learning and generalisation in a stable network. Computer Science, The University of Otago. [View Context].

Rudy Setiono. Extracting M-of-N Rules from Trained Neural Networks. School of Computing National University of Singapore. [View Context].

Jos'e L. Balc'azar. Rules with Bounded Negations and the Coverage Inference Scheme. Dept. LSI, UPC. [View Context].

Mehmet Dalkilic and Arijit Sengupta. A Logic-theoretic classifier called Circle. School of Informatics Center for Genomics and BioInformatics Indiana University. [View Context].

Daniel J. Lizotte and Omid Madani and Russell Greiner. Budgeted Learning, Part II: The Na#ve-Bayes Case. Department of Computing Science University of Alberta. [View Context].

Ron Kohavi and Barry G. Becker and Dan Sommerfield. Improving Simple Bayes. Data Mining and Visualization Group Silicon Graphics, Inc. [View Context].

Wl odzisl/aw Duch and Rafal Adamczak and Krzysztof Grabczewski and Norbert Jankowski. Control and Cybernetics. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Huan Liu. A Family of Efficient Rule Generators. Department of Information Systems and Computer Science National University of Singapore. [View Context].

Shi Zhong and Weiyu Tang and Taghi M. Khoshgoftaar. Boosted Noise Filters for Identifying Mislabeled Data. Department of Computer Science and Engineering Florida Atlantic University. [View Context].

Citation Request

Please refer to the Machine Learning Repository's citation policy


[1] Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML

Data Set Description

Source: http://archive.ics.uci.edu/ml/machin...-lepiota.names

1. Title: Mushroom Database

2. Sources: 
    (a) Mushroom records drawn from The Audubon Society Field Guide to North
        American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred
        A. Knopf
    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
    (c) Date: 27 April 1987

3. Past Usage:
    1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational
       Adjustment (Technical Report 87-19).  Doctoral disseration, Department
       of Information and Computer Science, University of California, Irvine.
       --- STAGGER: asymptoted to 95% classification accuracy after reviewing
           1000 instances.
    2. Iba,W., Wogulis,J., & Langley,P. (1988).  Trading off Simplicity
       and Coverage in Incremental Concept Learning. In Proceedings of 
       the 5th International Conference on Machine Learning, 73-79.
       Ann Arbor, Michigan: Morgan Kaufmann.  
       -- approximately the same results with their HILLARY algorithm    
    3. In the following references a set of rules (given below) were
	learned for this data set which may serve as a point of
	comparison for other researchers.

	Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules
	from training data using backpropagation networks, in: Proc. of the
	The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30,
	available on-line at: http://www.bioele.nuee.nagoya-u.ac.jp/wsc1/

	Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of
	crisp logical rules using constrained backpropagation networks -
	comparison of two new approaches, in: Proc. of the European Symposium
	on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997,
	pp. xx-xx

	Wlodzislaw Duch, Department of Computer Methods, Nicholas Copernicus
	University, 87-100 Torun, Grudziadzka 5, Poland
	e-mail: duch@phys.uni.torun.pl
	WWW     http://www.phys.uni.torun.pl/kmk/
	
	Date: Mon, 17 Feb 1997 13:47:40 +0100
	From: Wlodzislaw Duch <duch@phys.uni.torun.pl>
	Organization: Dept. of Computer Methods, UMK

	I have attached a file containing logical rules for mushrooms.
	It should be helpful for other people since only in the last year I
	have seen about 10 papers analyzing this dataset and obtaining quite
	complex rules. We will try to contribute other results later.

	With best regards, Wlodek Duch
	________________________________________________________________

	Logical rules for the mushroom data sets.

	Logical rules given below seem to be the simplest possible for the
	mushroom dataset and therefore should be treated as benchmark results.

	Disjunctive rules for poisonous mushrooms, from most general
	to most specific:

	P_1) odor=NOT(almond.OR.anise.OR.none)
	     120 poisonous cases missed, 98.52% accuracy

	P_2) spore-print-color=green
	     48 cases missed, 99.41% accuracy
         
	P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.
	          (stalk-color-above-ring=NOT.brown) 
	     8 cases missed, 99.90% accuracy
         
	P_4) habitat=leaves.AND.cap-color=white
	         100% accuracy     

	Rule P_4) may also be

	P_4') population=clustered.AND.cap_color=white

	These rule involve 6 attributes (out of 22). Rules for edible
	mushrooms are obtained as negation of the rules given above, for
	example the rule:

	odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green

	gives 48 errors, or 99.41% accuracy on the whole dataset.

	Several slightly more complex variations on these rules exist,
	involving other attributes, such as gill_size, gill_spacing,
	stalk_surface_above_ring, but the rules given above are the simplest
	we have found.


4. Relevant Information:
    This data set includes descriptions of hypothetical samples
    corresponding to 23 species of gilled mushrooms in the Agaricus and
    Lepiota Family (pp. 500-525).  Each species is identified as
    definitely edible, definitely poisonous, or of unknown edibility and
    not recommended.  This latter class was combined with the poisonous
    one.  The Guide clearly states that there is no simple rule for
    determining the edibility of a mushroom; no rule like ``leaflets
    three, let it be'' for Poisonous Oak and Ivy.

5. Number of Instances: 8124

6. Number of Attributes: 22 (all nominally valued)

7. Attribute Information: (classes: edible=e, poisonous=p)
     1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
                                  knobbed=k,sunken=s
     2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
     3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
                                  pink=p,purple=u,red=e,white=w,yellow=y
     4. bruises?:                 bruises=t,no=f
     5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
                                  musty=m,none=n,pungent=p,spicy=s
     6. gill-attachment:          attached=a,descending=d,free=f,notched=n
     7. gill-spacing:             close=c,crowded=w,distant=d
     8. gill-size:                broad=b,narrow=n
     9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
                                  green=r,orange=o,pink=p,purple=u,red=e,
                                  white=w,yellow=y
    10. stalk-shape:              enlarging=e,tapering=t
    11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
                                  rhizomorphs=z,rooted=r,missing=?
    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
    14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    16. veil-type:                partial=p,universal=u
    17. veil-color:               brown=n,orange=o,white=w,yellow=y
    18. ring-number:              none=n,one=o,two=t
    19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
                                  none=n,pendant=p,sheathing=s,zone=z
    20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
                                  orange=o,purple=u,white=w,yellow=y
    21. population:               abundant=a,clustered=c,numerous=n,
                                  scattered=s,several=v,solitary=y
    22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
                                  urban=u,waste=w,woods=d

8. Missing Attribute Values: 2480 of them (denoted by "?"), all for
   attribute #11.

9. Class Distribution: 
    --    edible: 4208 (51.8%)
    -- poisonous: 3916 (48.2%)
    --     total: 8124 instances

Data Folder

Index

Source: http://archive.ics.uci.edu/ml/machin...mushroom/Index

My Note: This looks like the history

03-Dec-1996 04:02    193

Index of mushroom

02 Dec 1996      193 Index
25 Jun 1990   111577 expanded.Z
26 Feb 1990     4167 agaricus-lepiota.names
30 May 1989      853 README
30 May 1989   373704 agaricus-lepiota.data

agaricus-lepiota.data

Source: http://archive.ics.uci.edu/ml/machin...s-lepiota.data

0-May-1989 13:49    365K

My Note: See spreadsheet or copy and paste into Spotfire!

agaricus-lepiota.names

Source: http://archive.ics.uci.edu/ml/machin...-lepiota.names

10-Sep-2008 12:44    6.7K

My Note: See Data Set Description Above

expanded.Z

Source: http://archive.ics.uci.edu/ml/machin...oom/expanded.Z

25-Jun-1990 17:52    109K

My Note: Caution - I opened this in NotePad but did not see anything I could use.

View All Data Sets

Source: http://archive.ics.uci.edu/ml/datasets.html

My Note 270 on February 10, 2014

Name

Data Types

Default Task

Attribute Types

# Instances

# Attributes

Year

 

Abalone

Multivariate 

Classification 

Categorical, Integer, Real 

4177 

1995 

 

Adult

Multivariate 

Classification 

Categorical, Integer 

48842 

14 

1996 

 

Annealing

Multivariate 

Classification 

Categorical, Integer, Real 

798 

38 

 

 

Anonymous Microsoft Web Data

 

Recommender-Systems 

Categorical 

37711 

294 

1998 

 

Arrhythmia

Multivariate 

Classification 

Categorical, Integer, Real 

452 

279 

1998 

 

Artificial Characters

Multivariate 

Classification 

Categorical, Integer, Real 

6000 

1992 

 

Audiology (Original)

Multivariate 

Classification 

Categorical 

226 

 

1987 

 

Audiology (Standardized)

Multivariate 

Classification 

Categorical 

226 

69 

1992 

 

Auto MPG

Multivariate 

Regression 

Categorical, Real 

398 

1993 

 

Automobile

Multivariate 

Regression 

Categorical, Integer, Real 

205 

26 

1987 

 

Badges

Univariate, Text 

Classification 

 

294 

1994 

 

Balance Scale

Multivariate 

Classification 

Categorical 

625 

1994 

 

Balloons

Multivariate 

Classification 

Categorical 

16 

 

 

Breast Cancer

Multivariate 

Classification 

Categorical 

286 

1988 

 

Breast Cancer Wisconsin (Original)

Multivariate 

Classification 

Integer 

699 

10 

1992 

 

Breast Cancer Wisconsin (Prognostic)

Multivariate 

Classification, Regression 

Real 

198 

34 

1995 

 

Breast Cancer Wisconsin (Diagnostic)

Multivariate 

Classification 

Real 

569 

32 

1995 

 

Pittsburgh Bridges

Multivariate 

Classification 

Categorical, Integer 

108 

13 

1990 

 

Car Evaluation

Multivariate 

Classification 

Categorical 

1728 

1997 

 

Census Income

Multivariate 

Classification 

Categorical, Integer 

48842 

14 

1996 

 

Chess (King-Rook vs. King-Knight)

Multivariate, Data-Generator 

Classification 

Categorical, Integer 

 

22 

1988 

 

Chess (King-Rook vs. King-Pawn)

Multivariate 

Classification 

Categorical 

3196 

36 

1989 

 

Chess (King-Rook vs. King)

Multivariate 

Classification 

Categorical, Integer 

28056 

1994 

 

Chess (Domain Theories)

Domain-Theory 

 

 

 

 

 

 

Bach Chorales

Univariate, Time-Series 

 

Categorical, Integer 

100 

 

 

Connect-4

Multivariate, Spatial 

Classification 

Categorical 

67557 

42 

1995 

 

Credit Approval

Multivariate 

Classification 

Categorical, Integer, Real 

690 

15 

 

 

Japanese Credit Screening

Multivariate, Domain-Theory 

Classification 

Categorical, Real, Integer 

125 

 

1992 

 

Computer Hardware

Multivariate 

Regression 

Integer 

209 

1987 

 

Contraceptive Method Choice

Multivariate 

Classification 

Categorical, Integer 

1473 

1997 

 

Covertype

Multivariate 

Classification 

Categorical, Integer 

581012 

54 

1998 

 

Cylinder Bands

Multivariate 

Classification 

Categorical, Integer, Real 

512 

39 

1995 

 

Dermatology

Multivariate 

Classification 

Categorical, Integer 

366 

33 

1998 

 

Diabetes

Multivariate, Time-Series 

 

Categorical, Integer 

 

20 

 

 

DGP2 - The Second Data Generation Program

Data-Generator 

 

Real 

 

 

 

 

Document Understanding

 

 

 

 

 

1994 

 

EBL Domain Theories

 

 

 

 

 

 

 

Echocardiogram

Multivariate 

Classification 

Categorical, Integer, Real 

132 

12 

1989 

 

Ecoli

Multivariate 

Classification 

Real 

336 

1996 

 

Flags

Multivariate 

Classification 

Categorical, Integer 

194 

30 

1990 

 

Function Finding

 

Function-Learning 

Real 

352 

 

1990 

 

Glass Identification

Multivariate 

Classification 

Real 

214 

10 

1987 

 

Haberman's Survival

Multivariate 

Classification 

Integer 

306 

1999 

 

Hayes-Roth

Multivariate 

Classification 

Categorical 

160 

1989 

 

Heart Disease

Multivariate 

Classification 

Categorical, Integer, Real 

303 

75 

1988 

 

Hepatitis

Multivariate 

Classification 

Categorical, Integer, Real 

155 

19 

1988 

 

Horse Colic

Multivariate 

Classification 

Categorical, Integer, Real 

368 

27 

1989 

 

Housing

Multivariate 

Regression 

Categorical, Integer, Real 

506 

14 

1993 

 

ICU

Multivariate, Time-Series 

 

Real 

 

 

 

 

Image Segmentation

Multivariate 

Classification 

Real 

2310 

19 

1990 

 

Internet Advertisements

Multivariate 

Classification 

Categorical, Integer, Real 

3279 

1558 

1998 

 

Ionosphere

Multivariate 

Classification 

Integer, Real 

351 

34 

1989 

 

Iris

Multivariate 

Classification 

Real 

150 

1988 

 

ISOLET

Multivariate 

Classification 

Real 

7797 

617 

1994 

 

Kinship

Relational 

Relational-Learning 

Categorical 

104 

12 

1990 

 

Labor Relations

Multivariate 

 

Categorical, Integer, Real 

57 

16 

1988 

 

LED Display Domain

Multivariate, Data-Generator 

Classification 

Categorical 

 

1988 

 

Lenses

Multivariate 

Classification 

Categorical 

24 

1990 

 

Letter Recognition

Multivariate 

Classification 

Integer 

20000 

16 

1991 

 

Liver Disorders

Multivariate 

 

Categorical, Integer, Real 

345 

1990 

 

Logic Theorist

Domain-Theory 

 

 

 

 

 

 

Lung Cancer

Multivariate 

Classification 

Integer 

32 

56 

1992 

 

Lymphography

Multivariate 

Classification 

Categorical 

148 

18 

1988 

 

Mechanical Analysis

Multivariate 

Classification 

Categorical, Integer, Real 

209 

1990 

 

Meta-data

Multivariate 

Classification 

Categorical, Integer, Real 

528 

22 

1996 

 

Mobile Robots

Domain-Theory 

 

Categorical, Integer, Real 

 

 

1995 

 

Molecular Biology (Promoter Gene Sequences)

Sequential, Domain-Theory 

Classification 

Categorical 

106 

58 

1990 

 

Molecular Biology (Protein Secondary Structure)

Sequential 

Classification 

Categorical 

128 

 

 

 

Molecular Biology (Splice-junction Gene Sequences)

Sequential, Domain-Theory 

Classification 

Categorical 

3190 

61 

1992 

 

MONK's Problems

Multivariate 

Classification 

Categorical 

432 

1992 

 

Moral Reasoner

Domain-Theory 

 

 

202 

 

1994 

 

Multiple Features

Multivariate 

Classification 

Integer, Real 

2000 

649 

 

 

Mushroom

Multivariate 

Classification 

Categorical 

8124 

22 

1987 

 

Musk (Version 1)

Multivariate 

Classification 

Integer 

476 

168 

1994 

 

Musk (Version 2)

Multivariate 

Classification 

Integer 

6598 

168 

1994 

 

Nursery

Multivariate 

Classification 

Categorical 

12960 

1997 

 

Othello Domain Theory

Domain-Theory 

 

 

 

 

1991 

 

Page Blocks Classification

Multivariate 

Classification 

Integer, Real 

5473 

10 

1995 

 

Pima Indians Diabetes

Multivariate 

Classification 

Integer, Real 

768 

1990 

 

Optical Recognition of Handwritten Digits

Multivariate 

Classification 

Integer 

5620 

64 

1998 

 

Pen-Based Recognition of Handwritten Digits

Multivariate 

Classification 

Integer 

10992 

16 

1998 

 

Post-Operative Patient

Multivariate 

Classification 

Categorical, Integer 

90 

1993 

 

Primary Tumor

Multivariate 

Classification 

Categorical 

339 

17 

1988 

 

Prodigy

Domain-Theory 

 

 

 

 

 

 

Qualitative Structure Activity Relationships

Domain-Theory 

 

 

 

 

 

 

Quadruped Mammals

Multivariate, Data-Generator 

Classification 

Real 

 

72 

1992 

 

Servo

Multivariate 

Regression 

Categorical, Integer 

167 

1993 

 

Shuttle Landing Control

Multivariate 

Classification 

Categorical 

15 

1988 

 

Solar Flare

Multivariate 

Regression 

Categorical 

1389 

10 

1989 

 

Soybean (Large)

Multivariate 

Classification 

Categorical 

307 

35 

1988 

 

Soybean (Small)

Multivariate 

Classification 

Categorical 

47 

35 

1987 

 

Challenger USA Space Shuttle O-Ring

Multivariate 

Regression 

Integer 

23 

1993 

 

Low Resolution Spectrometer

Multivariate 

Classification 

Integer, Real 

531 

102 

1988 

 

Spambase

Multivariate 

Classification 

Integer, Real 

4601 

57 

1999 

 

SPECT Heart

Multivariate 

Classification 

Categorical 

267 

22 

2001 

 

SPECTF Heart

Multivariate 

Classification 

Integer 

267 

44 

2001 

 

Sponge

Multivariate 

Clustering 

Categorical, Integer 

76 

45 

 

 

Statlog Project

 

 

 

 

 

1992 

 

Student Loan Relational

Domain-Theory 

 

 

1000 

 

1993 

 

Teaching Assistant Evaluation

Multivariate 

Classification 

Categorical, Integer 

151 

1997 

 

Tic-Tac-Toe Endgame

Multivariate 

Classification 

Categorical 

958 

1991 

 

Thyroid Disease

Multivariate, Domain-Theory 

Classification 

Categorical, Real 

7200 

21 

1987 

 

Trains

Multivariate 

Classification 

Categorical 

10 

32 

1994 

 

University

Multivariate 

Classification 

Categorical, Integer 

285 

17 

1988 

 

Congressional Voting Records

Multivariate 

Classification 

Categorical 

435 

16 

1987 

 

Water Treatment Plant

Multivariate 

Clustering 

Integer, Real 

527 

38 

1993 

 

Waveform Database Generator (Version 1)

Multivariate, Data-Generator 

Classification 

Real 

5000 

21 

1988 

 

Waveform Database Generator (Version 2)

Multivariate, Data-Generator 

Classification 

Real 

5000 

40 

1988 

 

Wine

Multivariate 

Classification 

Integer, Real 

178 

13 

1991 

 

Yeast

Multivariate 

Classification 

Real 

1484 

1996 

 

Zoo

Multivariate 

Classification 

Categorical, Integer 

101 

17 

1990 

 

Undocumented

 

 

 

 

 

 

 

Twenty Newsgroups

Text 

 

 

20000 

 

1999 

 

Australian Sign Language signs

Multivariate, Time-Series 

Classification 

Categorical, Real 

6650 

15 

1999 

 

Australian Sign Language signs (High Quality)

Multivariate, Time-Series 

Classification 

Real 

2565 

22 

2002 

 

US Census Data (1990)

Multivariate 

Clustering 

Categorical 

2458285 

68 

 

 

Census-Income (KDD)

Multivariate 

Classification 

Categorical, Integer 

299285 

40 

2000 

 

Coil 1999 Competition Data

Multivariate 

 

Categorical, Real 

340 

17 

1999 

 

Corel Image Features

Multivariate 

 

Real 

68040 

89 

1999 

 

E. Coli Genes

Relational 

 

 

 

 

2001 

 

EEG Database

Multivariate, Time-Series 

 

Categorical, Integer, Real 

122 

1999 

 

El Nino

Spatio-temporal 

 

Integer, Real 

178080 

12 

1999 

 

Entree Chicago Recommendation Data

Transactional, Sequential 

Recommender-Systems 

Categorical 

50672 

 

2000 

 

CMU Face Images

Image 

Classification 

Integer 

640 

 

1999 

 

Insurance Company Benchmark (COIL 2000)

Multivariate 

Regression, Description 

Categorical, Integer 

9000 

86 

2000 

 

Internet Usage Data

Multivariate 

 

Categorical, Integer 

10104 

72 

1999 

 

IPUMS Census Database

Multivariate 

 

Categorical, Integer 

256932 

61 

1999 

 

Japanese Vowels

Multivariate, Time-Series 

Classification 

Real 

640 

12 

 

 

KDD Cup 1998 Data

Multivariate 

Regression 

Categorical, Integer 

191779 

481 

1998 

 

KDD Cup 1999 Data

Multivariate 

Classification 

Categorical, Integer 

4000000 

42 

1999 

 

M. Tuberculosis Genes

Relational 

 

 

 

 

2001 

 

Movie

Multivariate, Relational 

 

 

10000 

 

1999 

 

MSNBC.com Anonymous Web Data

Sequential 

 

Categorical 

989818 

 

 

 

NSF Research Award Abstracts 1990-2003

Text 

 

 

129000 

 

2003 

 

Pioneer-1 Mobile Robot Data

Multivariate, Time-Series 

 

Categorical, Real 

 

 

1999 

 

Pseudo Periodic Synthetic Time Series

Univariate, Time-Series 

 

 

100000 

 

1999 

 

Reuters-21578 Text Categorization Collection

Text 

Classification 

Categorical 

21578 

1997 

 

Robot Execution Failures

Multivariate, Time-Series 

Classification 

Integer 

463 

90 

1999 

 

Synthetic Control Chart Time Series

Time-Series 

Classification, Clustering 

Real 

600 

 

1999 

 

Syskill and Webert Web Page Ratings

Multivariate, Text 

Classification 

Categorical 

332 

1998 

 

UNIX User Data

Text, Sequential 

 

 

 

 

 

 

Volcanoes on Venus - JARtool experiment

Image 

Classification 

 

 

 

 

 

Statlog (Australian Credit Approval)

Multivariate 

Classification 

Categorical, Integer, Real 

690 

14 

 

 

Statlog (German Credit Data)

Multivariate 

Classification 

Categorical, Integer 

1000 

20 

1994 

 

Statlog (Heart)

Multivariate 

Classification 

Categorical, Real 

270 

13 

 

 

Statlog (Landsat Satellite)

Multivariate 

Classification 

Integer 

6435 

36 

1993 

 

Statlog (Image Segmentation)

Multivariate 

Classification 

Real 

2310 

19 

1990 

 

Statlog (Shuttle)

Multivariate 

Classification 

Integer 

58000 

 

 

Statlog (Vehicle Silhouettes)

Multivariate 

Classification 

Integer 

946 

18 

 

 

Connectionist Bench (Nettalk Corpus)

Multivariate 

 

Categorical 

20008 

 

 

Connectionist Bench (Sonar, Mines vs. Rocks)

Multivariate 

Classification 

Real 

208 

60 

 

 

Connectionist Bench (Vowel Recognition - Deterding Data)

 

Classification 

Real 

528 

10 

 

 

Economic Sanctions

Domain-Theory 

 

 

 

 

 

 

Protein Data

 

 

 

 

 

 

 

Cloud

Multivariate 

 

Real 

1024 

10 

1989 

 

CalIt2 Building People Counts

Multivariate, Time-Series 

 

Categorical, Integer 

10080 

2006 

 

Dodgers Loop Sensor

Multivariate, Time-Series 

 

Categorical, Integer 

50400 

2006 

 

Poker Hand

Multivariate 

Classification 

Categorical, Integer 

1025010 

11 

2007 

 

MAGIC Gamma Telescope

Multivariate 

Classification 

Real 

19020 

11 

2007 

 

UJI Pen Characters

Multivariate, Sequential 

Classification 

Integer 

1364 

 

2007 

 

Mammographic Mass

Multivariate 

Classification 

Integer 

961 

2007 

 

Forest Fires

Multivariate 

Regression 

Real 

517 

13 

2008 

 

Reuters Transcribed Subset

Text 

Classification 

 

200 

 

2008 

 

Bag of Words

Text 

Clustering 

Integer 

8000000 

100000 

2008 

 

Concrete Compressive Strength

Multivariate 

Regression 

Real 

1030 

2007 

 

Hill-Valley

Sequential 

Classification 

Real 

606 

101 

2008 

 

Arcene

Multivariate 

Classification 

Real 

900 

10000 

2008 

 

Dexter

Multivariate 

Classification 

Integer 

2600 

20000 

2008 

 

Dorothea

Multivariate 

Classification 

Integer 

1950 

100000 

2008 

 

Gisette

Multivariate 

Classification 

Integer 

13500 

5000 

2008 

 

Madelon

Multivariate 

Classification 

Real 

4400 

500 

2008 

 

Ozone Level Detection

Multivariate, Sequential, Time-Series 

Classification 

Real 

2536 

73 

2008 

 

Abscisic Acid Signaling Network

Multivariate 

Causal-Discovery 

Integer 

300 

43 

2008 

 

Parkinsons

Multivariate 

Classification 

Real 

197 

23 

2008 

 

Character Trajectories

Time-Series 

Classification, Clustering 

Real 

2858 

2008 

 

Blood Transfusion Service Center

Multivariate 

Classification 

Real 

748 

2008 

 

UJI Pen Characters (Version 2)

Multivariate, Sequential 

Classification 

Integer 

11640 

 

2009 

 

Semeion Handwritten Digit

Multivariate 

Classification 

Integer 

1593 

256 

2008 

 

SECOM

Multivariate 

Classification, Causal-Discovery 

Real 

1567 

591 

2008 

 

Plants

Multivariate 

Clustering 

Categorical 

22632 

70 

2008 

 

Libras Movement

Multivariate, Sequential 

Classification, Clustering 

Real 

360 

91 

2009 

 

Concrete Slump Test

Multivariate 

Regression 

Real 

103 

10 

2009 

 

Communities and Crime

Multivariate 

Regression 

Real 

1994 

128 

2009 

 

Acute Inflammations

Multivariate 

Classification 

Categorical, Integer 

120 

2009 

 

Wine Quality

Multivariate 

Classification, Regression 

Real 

4898 

12 

2009 

 

URL Reputation

Multivariate, Time-Series 

Classification 

Integer, Real 

2396130 

3231961 

2009 

 

p53 Mutants

Multivariate 

Classification 

Real 

16772 

5409 

2010 

 

Parkinsons Telemonitoring

Multivariate 

Regression 

Integer, Real 

5875 

26 

2009 

 

Demospongiae

Multivariate 

Classification 

Integer 

503 

 

2010 

 

Opinosis Opinion ⁄ Review

Text 

 

 

51 

 

2010 

 

Breast Tissue

Multivariate 

Classification 

Real 

106 

10 

2010 

 

Cardiotocography

Multivariate 

Classification 

Real 

2126 

23 

2010 

 

Wall-Following Robot Navigation Data

Multivariate, Sequential 

Classification 

Real 

5456 

24 

2010 

 

Spoken Arabic Digit

Multivariate, Time-Series 

Classification 

Real 

8800 

13 

2010 

 

Localization Data for Person Activity

Univariate, Sequential, Time-Series 

Classification 

Real 

164860 

2010 

 

AutoUniv

Multivariate 

Classification 

Categorical, Integer, Real 

 

 

2010 

 

Steel Plates Faults

Multivariate 

Classification 

Integer, Real 

1941 

27 

2010 

 

MiniBooNE particle identification

Multivariate 

Classification 

Real 

130065 

50 

2010 

 

YearPredictionMSD

Multivariate 

Regression 

Real 

515345 

90 

2011 

 

PEMS-SF

Multivariate, Time-Series 

Classification 

Real 

440 

138672 

2011 

 

OpinRank Review Dataset

Text 

 

 

 

 

2011 

 

Relative location of CT slices on axial axis

Domain-Theory 

Regression 

Real 

53500 

386 

2011 

 

Online Handwritten Assamese Characters Dataset

Multivariate, Sequential 

Classification 

Integer 

8235 

 

2011 

 

PubChem Bioassay Data

Multivariate 

Classification 

Integer, Real 

 

 

2011 

 

Record Linkage Comparison Patterns

Multivariate 

Classification 

Real 

5749132 

12 

2011 

 

Communities and Crime Unnormalized

Multivariate 

Regression 

Real 

2215 

147 

2011 

 

Vertebral Column

Multivariate 

Classification 

Real 

310 

2011 

 

EMG Physical Action Data Set

Time-Series 

Classification 

Real 

10000 

2011 

 

Vicon Physical Action Data Set

Time-Series 

Classification 

Real 

3000 

27 

2011 

 

Amazon Commerce reviews set

Multivariate, Text, Domain-Theory 

Classification 

Real 

1500 

10000 

2011 

 

Amazon Access Samples

Time-Series, Domain-Theory 

Regression, Clustering, Causal-Discovery 

 

30000 

20000 

2011 

 

Reuter_50_50

Multivariate, Text, Domain-Theory 

Classification, Clustering 

Real 

2500 

10000 

2011 

 

Farm Ads

Text 

Classification 

 

4143 

54877 

2011 

 

DBWorld e-mails

Text 

Classification 

 

64 

4702 

2011 

 

KEGG Metabolic Relation Network (Directed)

Multivariate, Univariate, Text 

Classification, Regression, Clustering 

Integer, Real 

53414 

24 

2011 

 

KEGG Metabolic Reaction Network (Undirected)

Multivariate, Univariate, Text 

Classification, Regression, Clustering 

Integer, Real 

65554 

29 

2011 

 

Bank Marketing

Multivariate 

Classification 

Real 

45211 

17 

2012 

 

YouTube Comedy Slam Preference Data

Text 

Classification 

 

1138562 

2012 

 

Gas Sensor Array Drift Dataset

Multivariate 

Classification 

Real 

13910 

128 

2012 

 

ILPD (Indian Liver Patient Dataset)

Multivariate 

Classification 

Integer, Real 

583 

10 

2012 

 

OPPORTUNITY Activity Recognition

Multivariate, Time-Series 

Classification 

Real 

2551 

242 

2012 

 

Nomao

Univariate 

Classification 

Real 

34465 

120 

2012 

 

SMS Spam Collection

Multivariate, Text, Domain-Theory 

Classification, Clustering 

Real 

5574 

 

2012 

 

Skin Segmentation

Univariate 

Classification 

Real 

245057 

2012 

 

Planning Relax

Univariate 

Classification 

Real 

182 

13 

2012 

 

PAMAP2 Physical Activity Monitoring

Multivariate, Time-Series 

Classification 

Real 

3850505 

52 

2012 

 

Restaurant & consumer data

Multivariate 

 

 

138 

47 

2012 

 

CNAE-9

Multivariate, Text 

Classification 

Integer 

1080 

857 

2012 

 

Individual household electric power consumption

Multivariate, Time-Series 

Regression, Clustering 

Real 

2075259 

2012 

 

seeds

Multivariate 

Classification, Clustering 

Real 

210 

2012 

 

Northix

Multivariate, Univariate, Text 

Classification 

Integer, Real 

115 

200 

2012 

 

QtyT40I10D100K

Sequential 

 

Integer 

3960456 

2012 

 

Legal Case Reports

Text 

Classification 

 

 

 

2012 

 

Human Activity Recognition Using Smartphones

Multivariate, Time-Series 

Classification, Clustering 

 

10299 

561 

2012 

 

One-hundred plant species leaves data set

 

Classification 

Real 

1600 

64 

2012 

 

Energy efficiency

Multivariate 

Classification, Regression 

Integer, Real 

768 

2012 

 

Yacht Hydrodynamics

Multivariate 

Regression 

Real 

308 

2013 

 

Fertility

Multivariate 

Classification, Regression 

Real 

100 

10 

2013 

 

Daphnet Freezing of Gait

Multivariate, Time-Series 

Classification 

Real 

237 

2013 

 

3D Road Network (North Jutland, Denmark)

Sequential, Text 

Regression, Clustering 

Real 

434874 

2013 

 

ISTANBUL STOCK EXCHANGE

Multivariate, Univariate, Time-Series 

Classification, Regression 

Real 

536 

2013 

 

Buzz in social media

Time-Series 

Regression 

Integer, Real 

140000 

77 

2013 

 

First-order theorem proving

Multivariate 

Classification 

Real 

6118 

51 

2013 

 

Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)

Sequential 

Classification 

Integer, Real 

165632 

18 

2013 

 

Gas sensor arrays in open sampling settings

Multivariate, Time-Series 

Classification 

Real 

18000 

1950000 

2013 

 

Climate Model Simulation Crashes

Multivariate 

Classification 

Real 

540 

18 

2013 

 

MicroMass

Multivariate 

Classification 

Real 

931 

1300 

2013 

 

QSAR biodegradation

Multivariate 

Classification 

Integer, Real 

1055 

41 

2013 

 

BLOGGER

Multivariate 

Classification 

 

100 

2013 

 

Daily and Sports Activities

Multivariate, Time-Series 

Classification, Clustering 

Real 

9120 

5625 

2013 

 

User Knowledge Modeling

Multivariate 

Classification, Clustering 

Integer 

403 

2013 

 

Reuters RCV1 RCV2 Multilingual, Multiview Text Categorization Test collection

Multivariate 

Classification 

Real 

111740 

 

2013 

 

NYSK

Multivariate, Sequential, Text 

Clustering 

 

10421 

2013 

 

Turkiye Student Evaluation

Multivariate 

Classification, Clustering 

 

5820 

33 

2013 

 

ser Knowledge Modeling Data (Students' Knowledge Levels on DC Electrical Machines)

Multivariate 

Classification 

Real 

403 

2013 

 

EEG Eye State

Multivariate, Sequential, Time-Series 

Classification 

Integer, Real 

14980 

15 

2013 

 

Physicochemical Properties of Protein Tertiary Structure

Multivariate 

Regression 

Real 

45730 

2013 

 

seismic-bumps

Multivariate 

Classification 

Real 

2584 

19 

2013 

 

banknote authentication

Multivariate 

Classification 

Real 

1372 

2013 

 

USPTO Algorithm Challenge, run by NASA-Harvard Tournament Lab and TopCoder Problem: Pat

Domain-Theory 

Classification 

Integer 

306 

2013 

 

YouTube Multiview Video Games Dataset

Multivariate, Text 

Classification, Clustering 

Integer, Real 

120000 

1000000 

2013 

 

Gas Sensor Array Drift Dataset at Different Concentrations

Multivariate, Time-Series 

Classification, Regression, Clustering, Causa 

Real 

13910 

129 

2013 

 

Activities of Daily Living (ADLs) Recognition Using Binary Sensors

Multivariate, Sequential, Time-Series 

Classification, Clustering 

 

2747 

 

2013 

 

SkillCraft1 Master Table Dataset

Multivariate 

Regression 

Integer, Real 

3395 

20 

2013 

 

Weight Lifting Exercises monitored with Inertial Measurement Units

Multivariate 

Classification 

Real 

39242 

152 

2013 

 

SML2010

Multivariate, Sequential, Time-Series, Text 

Regression 

Real 

4137 

24 

2014 

 

Bike Sharing Dataset

Univariate 

Regression 

Integer, Real 

17389 

16 

2013 

 

Predict keywords activities in a online social media

Multivariate, Sequential, Time-Series 

 

Integer, Real 

51 

35 

2013 

 

Thoracic Surgery Data

Multivariate 

Classification 

Integer, Real 

470 

17 

2013 

 

EMG dataset in Lower Limb

Multivariate, Time-Series 

 

Real 

132 

2014 

Scotch Whiskies Data

Source: http://adn.biol.umontreal.ca/~numeri...ta/scotch.html

Pierre Legendre
F.-J. Lapointe
May 1996
Département de sciences biologiques
Université de Montréal

Dear lover of Scotch whiskies and numerical data analysis,

Here are the Scotch whisky data (109 distilleries) used in the following paper:

Lapointe, F.-J. & P. Legendre. 1994. A classification of pure malt Scotch whiskiesApplied Statistics 43: 237-257.

There are 5 data sets: color, nose, body, palate, and finish. The binary (0-1) descriptors are in the same order as on p. 239 of the paper. In any case, they are all assembled (with identifiers) in an Excel data base, also included.

We also send you the list of geographic coordinates of the distilleries, given as decimal degrees: longitude WEST, followed by latitude NORTH. In Scotland, one degree north is about 1.87 times as long as one degree west. To be complete, We are also including a matrix of geographic distances among distilleries, already computed and written out as an ASCII file.

There are two whiskies in the classification from the Springbank distillery. One pertains to the Islay group, the other to the Western group.

Please let us know of the analysis you have performed, especially if you intend on publishing them.

Documents available for distribution include

  • Macintosh files

    • body(109x8) -- Body variable (109x8)
    • color(109x14) -- Color variable (109x14)
    • Dist geo./Scotch -- Geographic distance between distilleries (109x109)
    • Distillery coordinates
    • finish(109x19) -- Finish variable (109x19)
    • nose(109x12) -- Nose variable (109x12)
    • palate(109x15) -- Palate variable (109x15)
    • ReadMe-Scotch Data -- This file
    • Regions (109x3) 1 col. -- Regions variable, coded 1-2-3 (109x1)
    • Scotch (109x68) -- All the variables (109x68)
    • Scotch/XL3.0 -- Excel file of all the variables, including headers
  • DOS files

    • BODY.TXT -- Body variable (109x8)
    • COLOR.TXT -- Color variable (109x14)
    • DIST-GEO.TXT -- Geographic distance between distilleries (109x109)
    • DISTCOOR.TXT -- Distillery coordinates
    • FINISH.TXT -- Finish variable (109x19)
    • NOSSE.TXT -- Nose variable (109x12)
    • PALATE.TXT -- Palate variable (109x15)
    • README.TXT -- This file
    • REGIONS.TXT -- Regions variable, coded 1-2-3 (109x1)
    • SCOTCH.TXT -- All the variables (109x68)
    • SCOTCH.XLS -- Excel file (Mac version 3.0) of all the variables, including headers

NEXT

Page statistics
743 view(s) and 31 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

Attachments