##### Table of contents

- Story
- Slides
- TreeMiner Overview
- Predictive Analytics in Big Data Environments
- Introduction
- The Problem: Current Data Analytics Methods Stressed by Massive Data Growth
- Today’s Answers: Tactical Responses
- Insights to Action – Drivers, Enablers, and Blockers
- The Dimensions of the Problem
- Scalable Interoperable Solution Sets
- Our Solution
- Organize Data for Knowledge…
- Data Mining in Hadoop -Challenges
- Hadoop Platform for Big Data
- Streaming Analytics
- Treeminer Advantages
- Existing Hadoop Approaches
- Treeminer’s Scalable Approach
- Unified Approach
- Treeminer In-House Analytics
- Performance Increases as Nodes Increase
- Text Classification Performance
- Structured Data Performance
- vMiner Platform
- Recognition for Accuracy
- Summary

- Training Module –Basic Concepts
- Cover Page
- Data Mining
- Major Categories of Data Mining
- Popular Current Approaches
- Challenges
- What is Vertical Data Mining?
- Comparison
- The “Layers” of Vertical Data Mining
- Terms
- Clustering
- Classification
- Oblique Classifier–Key Concepts
- SPS Anomaly Detection –Key Concepts
- The Vertical Data Mining Process
- Artifacts –Extract Process
- Artifacts –Model Process

- TreeMiner Overview
- Spotfire Dashboard
- Research Notes
- Vertical Data Mining and p-Tree algebra
- The P-tree Algebra 1, 2
- ABSTRACT
- 1. INTRODUCTION
- 2. PEANO COUNT TREES (P-TREES)
- 3. P-TREE ALGEBRA
- 4. PROPERTIES OF P-TREES
- 5. IMPLEMENTATION ISSUES AND EXPERIMENTAL RESULTS
- 5.1 P-tree Header
- 5.2 Performance Analysis
- Figure 8. Comparison of file size for different bits of Band 1 & 2 of a TIFF image
- Figure 9. Comparison of file size for different bits of Band 3 & 4 of a SOPT image
- Figure 10. Comparison of file size for different bits of Band 5 & 6 of a TM image
- Figure 11. Comparison of time required to perform ANDing operation
- Figure 12. Average time required to perform ANDing operation

- 6. RELATED WORK
- 7. CONCLUSION
- 8. REFERENCES

- NEXT

- Story
- Slides
- TreeMiner Overview
- Predictive Analytics in Big Data Environments
- Introduction
- The Problem: Current Data Analytics Methods Stressed by Massive Data Growth
- Today’s Answers: Tactical Responses
- Insights to Action – Drivers, Enablers, and Blockers
- The Dimensions of the Problem
- Scalable Interoperable Solution Sets
- Our Solution
- Organize Data for Knowledge…
- Data Mining in Hadoop -Challenges
- Hadoop Platform for Big Data
- Streaming Analytics
- Treeminer Advantages
- Existing Hadoop Approaches
- Treeminer’s Scalable Approach
- Unified Approach
- Treeminer In-House Analytics
- Performance Increases as Nodes Increase
- Text Classification Performance
- Structured Data Performance
- vMiner Platform
- Recognition for Accuracy
- Summary

- Training Module –Basic Concepts
- Cover Page
- Data Mining
- Major Categories of Data Mining
- Popular Current Approaches
- Challenges
- What is Vertical Data Mining?
- Comparison
- The “Layers” of Vertical Data Mining
- Terms
- Clustering
- Classification
- Oblique Classifier–Key Concepts
- SPS Anomaly Detection –Key Concepts
- The Vertical Data Mining Process
- Artifacts –Extract Process
- Artifacts –Model Process

- TreeMiner Overview
- Spotfire Dashboard
- Research Notes
- Vertical Data Mining and p-Tree algebra
- The P-tree Algebra 1, 2
- ABSTRACT
- 1. INTRODUCTION
- 2. PEANO COUNT TREES (P-TREES)
- 3. P-TREE ALGEBRA
- 4. PROPERTIES OF P-TREES
- 5. IMPLEMENTATION ISSUES AND EXPERIMENTAL RESULTS
- 5.1 P-tree Header
- 5.2 Performance Analysis
- Figure 8. Comparison of file size for different bits of Band 1 & 2 of a TIFF image
- Figure 9. Comparison of file size for different bits of Band 3 & 4 of a SOPT image
- Figure 10. Comparison of file size for different bits of Band 5 & 6 of a TM image
- Figure 11. Comparison of time required to perform ANDing operation
- Figure 12. Average time required to perform ANDing operation

- 6. RELATED WORK
- 7. CONCLUSION
- 8. REFERENCES

- NEXT

## Story

**Data Science for Vertical Data Mining**

I have worked with a quad tree GIS sofyware package and with the IRIS data (see Clustering and Iris Data Set). **Spotfire 5 uses column analytics.**. **See Next Generation In-Memory Analysis**

I did a Google search for Vertical Data Mining and got Treeminer.com

Winner, 2006 KDD Cup Global Analytics Competition Source **My Note: I have this in my Data Science Tutorial. See ****KDD Cup 2006: Pulmonary embolisms detection from image data. See Story for 2006 data.**

Treeminer is pioneering a new approach to data mining analytics that focuses on organizing data for speed of knowledge extraction. By organizing data in thin vertical strips, analytics become dramatically more efficient as datasets scale. Source

What is Vertical Data Mining?

Modern analytics systems work by sequentially considering each new data point

, one by one. In the case of the bank, perhaps each data point represents a new loan application; in the case of the image analyst, perhaps it is a pixel in a satellite image. Each new data point requires that the analytics run on that point to execute on it. Unfortunately, if an image grows from one million pixels to two million pixels, you are doubling the amount of work required (and potentially quadrupling the execution time!) We call this point-driven classification - each point in the dataset requires a loop through an analytics engine. As a result, growing datasets cause analytics executions times to grow as well. Two million points, two million loops. Or more. This is the root of the scalability challenges for modern data mining systems.

Treeminer literally turns the problem on its side - we look at data not in rows, but in thin vertical strips. If we consider the examples above, a simple mortgage application may have a handful of fields, or attributes: name, social security number, credit score, property zip code, and value. If, instead of looping on each mortgage application to perform the analytics, you loop on the attributes, then you have a loop of five, no matter how many applications you have. Similarly, in the case of an image, each attribute is perhaps a color band - again, as images grow the number of color bands do not. This is the basis for the scalability advantage of vertical data mining.

Vertical Algebra

In order for vertical data mining to work, the complex algebra that is needed to perform the actual analytics needs to be able to be calculated on columns rather than rules. It turns out, another special property of the our thin vertical strips is that by combining them with simple logical operators, like AND, OR, XOR, NOT, etc., you can build an incredibly diverse and complex mathematics that can support very sophisticated data mining processes. Back to the mortgage example above, if we apply these logical AND and OR operators to the 5 vertical attributes - that is, take each attribute, combine them with logical operators, and get a result.

By working over a data set vertically, you get the analytics results for the whole dataset at once for each class you are looking for.

Using our vertical algebra, Treeminer designs and develops in house analytics methods design specifically leverage the advantages of our vertical data organization.

Classification

Treeminer's classification algorithm,*Oblique*, operates in a similar fashion to many classification methods - a separating hyperplane is constructed, and points are identified to be on one side of the hyperplane or the other, determining predicted class membership. So if our algorithm is logically similar to alternative methods, why is it faster?

The key is our vertical algebra: all distance calculations (e,g, distances from the points from the hyperplane) can be calculated across all datapoints in the dataset in a single logical operation, rather than once for each point in the dataset. A mask gets created for the entire dataset indicating which side of the hyperplane the point is on. The result: not only significant speed advantages, but more importantly, speed advantages that only grow bigger as the dataset gets larger.

Anomaly Detection

Anomaly detection algorithms identify points in a dataset that are unlike others. As such, they can indicate potential fraud (transactions that seem out of place), network intrusions (activity that does not follow a normal pattern), potential component failures (sensor readings that are not normal), or other similar applications. Using Treeminer's One-Class SPS method, anomalous behavior can be detected significantly faster, and as result action taken quicker than with existing methods.Clustering

Clustering methods group like datapoints together to identify items in a dataset that are similar to each other. Using vertical methods, clusters can be isolated across all points in the dataset at once, rather than point by point. Again, the result is significant performance gains without sacrificing accuracy.

For more details on Treeminer's high performance, scalable analytics, please download our White Paper, *An Introduction to Vertical Data Mining*. Source **My Note: I requested the White Paper**

**MORE TO FOLLOW**

## Spotfire Dashboard

## Research Notes

## Vertical Data Mining and p-Tree algebra

### July19, 2014

We already did well in some of the KDD challenges.

I am attaching some of the presentations of Treeminer and white papers/published papers

Parameter Optimized, Vertical, Neareest-Neighbor-Vote and Boundary-Based Classification

Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees

K-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees

On Mining Satellite and Other Remotely Sensed Images

We do have very high degree of confidence in Text predictive analytic and now moving into more complex systems that involve, Text, image, M2M and other data-format simultaneously as our strength and innovation is based on pattern matching in byte level. This technique is best leveraged in a system with complex and diverse datatype and having a large volume of data. Since we are detecting pattern at byte level, p-Tree will generate some new intelligence that won't be captured by SVM or EML or DML that is based on data interpretation. Besides it won't need data renormalization as will be needed in other machine learning types for a heterogenous data type most commonly found in clinical and healthcare data.

Overall, it will be fun to see how on a same system of data, Spotfire (assuming they are also doing SVM/NN) and p-Tree can generate some common intelligence as well some different intelligence as well.

What is important here is to emphasize a new aspect of Big Data as a Mathematical Complex System which will have a lot of local and global pattern in different data dimension within the same Big data system and not necessarily one kind of ML is going to capture it all. This will be a sharp difference with legacy data analysis.

And therefore any new fundamental approach of Machine learning promises a new paradigm of intelligence and analytic that may not be possible to capture by known forms of ML.

### July 19, 2014

So it sounds like you are doing both the backend and the front end like Spotfire, but without connectors to lots of data sets, unless I am missing something.

So I did a background story to get ready for a Meetup on Data Science for Vertical Data Mining.

Please see: http://semanticommunity.info/Data_Science/Data_Science_for_Vertical_Data_Mining

I suggest we use the KDD 2006 data set and the OpenFDA data set that Booke is working with since you are in discussions with him as well for collaboration like I am.

### July 17, 2014

Yes, that is common. On top, we also do ( or will be doing) auto-clustering, learning based clustering, pattern detection etc.

Now since these are mathematically complex system, they will have several pattern branch. Pattern generated from SVM, Neural network or RF will not be same as p-Tree. p-Tree will give some new patterns and new information even if you are already doing Fraud or clustering.

Value of a new machine learning algorithm like p-Tree is also in finding new "clue" that is not there in conventional SVM or NN or RF.

### July 16, 2014

Thank you, and Spotfire does both and more. Brand

### July 16, 2014

This sounds exciting. I will wait for Spotfire server to be up. I will discuss this and see what we can turn around as proposal and POC. What would be of interest?

1. Fraud detection?

2. Anomaly?

### July 16, 2014

Thank you.

1. What is the way to get those data? Please see what I did for their App Challenge to get their data: http://semanticommunity.info/AOL_Government/Department_of_Commerce_App_Challenge

Note: The Spotfire Server is down, but I have Spotfire screen captures.

*The article mentions new data sources I will find. 2. What is the roadmap to turn POC into a paid project for department of commerce? I met their head of Digital Strategy, Mike Kruger, and we could re-connect with him when we have something to show and talk about. Brand P.S. Brook Aker is getting the new OpenFDA data ready for Spotfire.*

### July 16, 2014

This is brilliant idea. I am supposing their data will be mixed with text and number.

What I could not get from the press release:

1. What is the way to get those data?

2. What is the roadmap to turn POC into a paid project for department of commerce?

FDA, FAA and NIH already released similar but paid projects on exploratory data analysis. Did department of commerce release something similar?

### July 16, 2014

Wonderful. I recommended this to the Commerce CIO at a recent White House Conference, so maybe we can do a pilot with Commerce/Census data. I have did one about 2 years ago now with Spotfire that got very favorable attention from them. Brand

**1. Commerce creates new 'chief data officer' post to help better leverage information**

By Dibya Sarkar

By Dibya Sarkar Commerce Secretary Penny Pritzker said July 14 that the department is hiring its first-ever chief data officer to help it "develop, test, and grow the next phase of the open data revolution." "Our chief data officer will pull together a platform for all of our data sets; oversee improvements in data collection and dissemination; and ensure our data programs are coordinated, comprehensive, and strategic," she said in prepared remarks. "Put simply, our chief data officer will hold the key to unlocking more of our government data. Commerce houses several agencies - such as the Census Bureau, Patent and Trademark Office, and the National Oceanic and Atmospheric Administration - which collect massive amounts of data. Pritzker made the announcement at a conference hosted by Esri, which supplies geographic information system software and applications. She also made several other announcements during the conference. She said the International Trade Administration has launched a portal that will house diverse sets of trade and investment data in one place to make it easier for businesses to mine information. The department is also creating a data advisory council of private sector leaders that will advise Commerce on how best to use government data. Its intent is to improve how data is used and shared to make businesses and governments more responsive, to better anticipate needs, and to develop new data products and services in collaboration with the private sector. The secretary said Commerce has released a new report about the value federal statistical programs that help governments and businesses gain more value and investment, "with the potential to guide up to $3.3 trillion in investments" annually in the U.S. "Our report makes another essential point: the cost of government data is small relative to its benefits," she said. "The federal government spends roughly three cents per person, per day, collecting and disseminating statistical data. In other words, when it comes to data, taxpayers get tremendous bang for their buck." For more:
Read more about: Census Bureau |

### July 16, 2014

This is brilliant idea. I am supposing their data will be mixed with text and number.

What I could not get from the press release:

1. What is the way to get those data?

2. What is the roadmap to turn POC into a paid project for department of commerce?

FDA, FAA and NIH already released similar but paid projects on exploratory data analysis. Did department of commerce release something similar?

### July 16, 2014

We enjoyed discussing with you and see immense possibility.

Yes, we definitely want to integrate our backend with Tibco/Spotfire Geoanalytic front end. But I need to look out for a suitable project or POC that will turn into cash. BTW, Ovum is Telco analytic primarily.

There are 3 elements to any data science project- Data, algorithm and infrastructure. If all 3 are available from a community resource

for POC or exploratory data analysis, it will be of immense benefit to every one.

### July 16, 2014

Thanks for the excellent discussions. I hope we covered your 3 topics.

Be Informed: http://www.beinformed.com/BeInformed...ysts?init=true

And whatever else is on your list you will send me.

So a connector from Spotfire would enable me to do front end visualizations and analytics with your back end database.

My writing a data story about my experience with the doing the above and our organizing a Meetup to present that would seem to be the way to start on these three topics.

Becoming the company that provides fast turn-around with data science and data technology would seem to be the longer-term win-win for both of us. Brand

### July 10, 2014

Okay, thank you, and sounds good! Brand

### July 9, 2014

I would like to touch base on 3 topics--

1. Exploratory federal data analysis-we have 3 partners with large Hadoop cluster (50-100 node) -who are interested in such exploratory analysis. I would like to discuss what would be a formal way of setting up a service mechanism of running large Govt data using Treeminer +cluster

2. How could we help in your present projects which involve Big Data predictive analytic.

3. How could we partner with you ( provided you are satisfied with our technology) in different Govt projects.

### July 8, 2014

Dear Dr Niemann

It was a great pleasure to meet you in Federal Big Data meet-up. I am also encouraged hearing Dr Sam Madden emphasized use of verticalized and graph based database for Big Data solution because in Treeminer for last two years we have been developing exactly the same and now we have a viable product called p-Tree algebra which works on verticalized database.

p-Algebra based machine learning in vertical database approach is at least 10x to 100x times faster than horizontal based approach which uses traditional Machine learning like SVM or Random Forest. Please see the attached presentations.

We have released it for native Hadoop and Storm. Also we will release it for Spark next quarter. Agency with the largest Big Data has been our customer and now we want to expand to other agencies through partnership.

Since you already understand the merit of our technology, is it possible that we meet face to face and discuss business possibilities? You are welcome to our office any time.

## The P-tree Algebra 1, 2

Source: PDF

Qin Ding, Maleq Khan, Amalendu Roy and William Perrizo

Computer Science Department, North Dakota State University

Fargo, ND 58105-5164, USA

qin.ding@ndsu.nodak.edu

1 Patents are pending on the bSQ and P-tree technology.

2 This work is partially supported by GSA Grant ACT# K96130308, NSF Grant OSR-9553368 and DARPA Grant DAAH04-96-1-0329. Permission to make digital or hard copies of all of part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 2002, Madrid, Spain 2002 ACM 1-58113-445-2/02/03…$5.00

### ABSTRACT

The Peano Count Tree (P-tree) is a quadrant-based lossless tree representation of the original spatial data. The idea of P-tree is to recursively divide the entire spatial data, such as Remotely Sensed Imagery data, into quadrants and record the count of 1-bits for each quadrant, thus forming a quadrant count tree. Using P-tree structure, all the count information can be calculated quickly. This facilitates efficient ways for data mining. In this paper, we will focus on the algebra and properties of P-tree structure and its variations. We have implemented fast algorithms for P-tree generation and P-tree operations. Our performance analysis shows P-tree has small space and time costs compared to the original data. We have also implemented some data mining algorithms using P-trees, such as Association Rule Mining, Decision Tree Classification and K-Clustering.

Keywords Compression, Quadrant, Peano Ordering, Spatial Data, Tree Structure

### 1. INTRODUCTION

More and more spatial data have been collected in various ways. An important issue is how to efficiently store spatial data and derive useful information from them. Data mining can help to find interesting patterns or derive useful rules from spatial data. However, existing data mining algorithms do not scale well on large sized spatial data.

In this paper, we try to discuss a data structure, called Peano Count Tree (or P-tree), and its algebra and properties. P-tree structure is a lossless representation of the original spatial data. The P-tree structure can be viewed as a data-mining-ready structure as it facilitates efficient ways for data mining [**7**].

One feature of spatial data is the neighborhood property. For example, in an image, neighboring pixels may have similar properties. The P-tree structure is based on this feature.

In this paper, we will focus on remotely sensed imagery data, including satellite images, aerial photography, and ground data. A remotely sensed image typically contains several attributes, called bands. For example, TM (Thematic Mapper) scenes contain at least seven bands (Blue, Green, Red, NIR, MIR, TIR and MIR2) while a TIFF image contains three bands (Blue, Green and Red). Each band contains a relative reflectance intensity value in the range 0-to-255 (one byte) for each pixel location. Ground data are collected at the surface of the earth and can also be organized into images. Yield is a typical example of ground data.

An image can be viewed as a relational table in which each pixel is a tuple and each band is an attribute. The primary key can be expressed as x-y coordinates or as latitude-longitude pairs. RSI data are collected in different ways and are organized in different formats. BSQ, BIL and BIP are three typical formats. The Band Sequential (BSQ) format is similar to the relational format. In BSQ format, each band is stored as a separate file and each individual band uses the same raster order. TM scenes are in BSQ format. The Band Interleaved by Line (BIL) format stores the data in line-major order, i.e., the first row of all bands, followed by the second row of all bands, and so on. For example, SPOT data, which comes from French satellite sensors, is in BIL format. Band Interleaved by Pixel (BIP) is a pixel-major format. Standard TIFF images are in BIP format.

We propose a new format, called bit Sequential (bSQ), to organize spatial data [**7**]. A reflectance value in a band is a number in the range 0-255 and is represented as a byte. We split each band into eight separate files, one for each bit position. There are several reasons why we use the bSQ format. First, different bits make different contributions to the value. In some applications, the high-order bits alone provide the necessary information. Second, the bSQ format facilitates the representation of a precision hierarchy (from one bit up to eight bit precision). Third, bSQ format facilitates better compression.

By using bSQ format, neighborhood pixels may have the same bit values in several high order bits. This facilitates high compression for high order bit files and brings us the idea of creating P-trees. P-trees are basically quadrant-wise, Peanoorder- run-length-compressed, representations of each bSQ file. There are several operations of P-trees, in which AND is the most important one. The neighborhood feature of image data also makes P-tree operations fast. Operations on a group of pixels having the same bit value can be done together without considering each bit individually. Fast P-tree operations, especially fast AND operation, provide the possibilities for efficient data mining.

In Figure 1, we give a very simple illustrative example with only two bands in a scene having only four pixels (two rows and two columns). Both decimal and binary reflectance values are given. We can see the difference of BSQ, BIL, BIP and bSQ formats.

The rest of the paper is organized as follows. Section 2 summarizes the basic ideas of Peano Count tree and its variation. Section 3 describes the algebra of P-tree. Section 4 discusses the properties of P-tree structure. Section 5 discusses the implementation issues and experimental results. Section 6 gives some related work. Conclusion is given in Section 7.

### 2. PEANO COUNT TREES (P-TREES)

#### 2.1 Basic P-trees

We reorganize each bit file of the bSQ format into a tree structure, called a Peano Count Tree (P-tree). The idea is to recursively divide the entire image into quadrants and record the count of 1- bits for each quadrant, thus forming a quadrant count tree [**7**]. Ptrees are somewhat similar in construction to other data structures in the literature (e.g., Quadtrees [**3, 4, 5**] and HHcodes [**6**]).

For example, given a 8×8 bSQ file (one-bit-one-band file), its Ptree is as shown in Figure 2.

##### Figure 2. P-tree for a 8×8 bSQ file

In this example, 36 is the number of 1’s in the entire image, called root count. This root level is labeled level 0. The numbers 16, 7, 13, and 0 at the next level (level 1) are the 1-bit counts for the four major quadrants in raster order. Since the first and last level-1 quadrants are composed entirely of 1-bits (called pure-1 quadrants) and 0-bits (called pure-0 quadrants) respectively, subtrees are not needed and these branches terminate. This pattern is continued recursively using the Peano or Z-ordering (recursive raster ordering) of the four sub-quadrants at each new level. Eventually, every branch terminates (since, at the “leaf” level all quadrant are pure). If we were to expand all sub-trees, including those for pure quadrants, then the leaf sequence would be the Peano-ordering of the image. The Peano-ordering of the original image is called Peano Sequence. Thus, we use the name Peano Count Tree for the tree structure above.

The fan-out of a P-tree need not be fixed at four. It can be any power of 4 (effectively skipping levels in the tree). Also, the fanout at any one level need not coincide with the fan-out at another level. The fan-out pattern can be chosen to produce maximum compression for each bSQ file. We use P-Tree-r-i-l to indicate the fan-out pattern, where r is the fan out of the root node, i is the fan out of all internal nodes at level 1 to L-1 (where root has level L, and leaf has level 0), and l is the fan out of all nodes at level 1. We have implemented P-Tree-4-4-4, P-Tree-4-4-16, and P-Tree- 4-4-64.

**Definition 1**: A basic P-tree Pi, j is a P-tree for the jth bit of the ith band i. The complement of basic P-tree Pi, j is denoted as Pi, j ’ (the complement operation is explained below).

For each band (assuming 8-bit data values, though the model applies to data of any number bits), there are eight basic P-trees, one for each bit position. We will call these P-trees the basic Ptrees of the spatial dataset. We will use the notation, Pb,i to denote the basic P-tree for band, b and bit position, i. There are always 8n basic P-trees for a dataset with n bands.

P-trees have the following features:

- P-trees contain 1-count for every quadrant of every dimension.
- The P-tree for any sub-quadrant at any level is simply the sub-tree rooted at that sub-quadrant.
- A P-tree leaf sequence (depth-first) is a partial runlength compressed version of the original bit-band.
- Basic P-trees can be combined to reproduce the original data (P-trees are lossless representations).
- P-trees can be partially combined to produce upper and lower bounds on all quadrant counts.
- P-trees can be used to smooth data by bottom-up quadrant purification (bottom-up replacement of mixed counts with their closest pure counts).

P-trees can be generated quite quickly and can be viewed as a “data mining ready” and lossless format for storing spatial data.

#### 2.2 P-tree variations

A variation of the P-tree data structure, the Peano Mask Tree (PM-tree, or PMT), is a similar structure in which masks rather than counts are used. In a PM-tree, we use a 3-value logic to represent pure-1, pure-0 and mixed quadrants (1 denotes pure-1, 0 denotes pure-0 and m denotes mixed). The PM-tree for the previous example is also given in Figure 3. Since a PM-tree is just an alternative implementation for a Peano Count tree (PCtree, or PCT), we will use the term “P-tree” to cover both Peano Count tree (PCT) and Peano Mask tree (PMT).

##### Figure 3. PM-tree

We can use some other variations, such as P1-tree and P0-Tree. In P1-tree, we use 1 to indicate the pure-1 quadrant while use 0 to indicate others. In P0-tree, we use 1 to indicate the pure-0 quadrant while use 0 to indicate others. Both P1-tree and P0-tree are lossless representations of the original data.

The P1-tree and P0-tree of the previous example are given in Figure 4.

##### Figure 4. P1-tree and P0-tree

We can also use non-pure-0-tree (NP0-tree) and non-pure-1-tree (NP1-tree). In non-pure-0-tree, we use 1 to indicate non-pure-0 quadrant that covers pure-1 quadrant and mixed quadrant while use 0 to indicate pure-0 quadrant. NP1-tree can be defined in a similar way.

### 3. P-TREE ALGEBRA

P-tree algebra includes three basic operations: complement, AND and OR. Each basic P-tree has a natural complement. The complement of a basic P-tree can be constructed directly from the P-tree by simply complementing the counts at each level (subtracting from the pure-1 count at that level), as shown in the example below (Figure 4). Note that the complement of a P-tree provides the 0-bit counts for each quadrant. P-tree AND/OR operations are also illustrated in Figure 5.

#### Figure 5. P-tree Algebra (Complement, AND and OR)

Among three operations, AND is the most important operation. OR operation can be implemented in the very similar way as AND. Below we will discuss various options to implement P-tree ANDing.

#### 3.1 Levelwise P-tree ANDing

NDing is a very important and frequently used operation for Ptrees. There are several ways to perform P-tree ANDing. First let’s look at a simple way. We can perform ANDing level-bylevel starting from the root level. Table 1 gives the rules for performing P-tree ANDing. Operand 1 and Operand 2 are two Ptrees (or subtrees) with root X1 and X2 respectively. Using PMtrees, X1 and X2 could be any value among 1, 0 and m (3-value logic representing pure-1, pure-0 and mixed quadrant). Rules for P-tree ANDing are given in Table 1. For example, to AND a pure-1 P-tree with any P-tree will result in the second operand; to AND a pure-0 P-tree with any P-tree will result in the pure-0 Ptree. Note that it is possible that ANDing two m’s results in a pure-0 quadrant if their four subquants result in pure-0 quadrants.

##### Table 1. P-tree ANDing rules

Operand 1 | Operand 2 | Result |

1 | X_{2} | Subtree with root X_{2} |

0 | X_{2} | 0 |

X_{1} | 1 | Subtree with root X_{1} |

X_{1} | 0 | 0 |

m | m | 0 if four sub-quadrants result in 0; Otherwise m |

#### 3.2 P-tree ANDing using Pure-1 paths

There is another way to do P-tree ANDing which is more efficient. The approach is to store only the basic P-trees and then generate the value and tuple P-tree root counts “on-the-fly” as needed. In this algorithm, we will assume P-trees are coded in a compact, depth-first ordering of the paths to each pure-1 quadrant. We use a hierarchical quadrant id (Qid) scheme (Figure 6) to identify quadrants. At each level, we append a sub-quadrant id number (0 means upper left, 1 means upper right, 2 means lower left, 3 means lower right).

##### Figure 6. Quadrant id (Qid)

For a spatial data set with 2n-row and 2n-column, there is a mapping from raster coordinates (x, y) to Peano coordinates (called quadrant id or Qid). If x and y are expressed as n-bit strings, x1x2…xn and y1y2…yn, then the mapping is (x, y)=(x1x2…xn, y1y2…yn) ! (x1y1 . x2y2… . xnyn). Thus, in an 8 by 8 image, the pixel at (3,6) = (011,110) has quadrant id 01.11.10 = 1.3.2. For simplicity, we wrote the Qid as 132 instead of 1.3.2 in Figure 6.

An example is given in Figure 7. Each path is represented by the sequence of quadrants in Peano order, beginning just below the root. Since a quadrant will be pure-1 in the result only if it is pure-1 in both/all operands, the AND is done as follows: scan the operands; output matching pure-1 paths.

The AND operation is effectively the pixel-wise AND of bits from bSQ files or their complement files. However, since such files can contain hundreds of millions of bits, shortcut methods are needed. Implementations of these methods have been done which allow the performance of an n-way AND of Tiff-image Ptrees (1320 by 1320 pixels) in about 20 milliseconds. We discuss such methods later in the paper. The process of converting data to P-trees is also time consuming unless special methods are used. For example, our methods can convert even a large TM satellite image (approximately 60 million pixels) to its basic P-trees in just a few seconds using a high performance PC computer. This is a one-time process.

##### Figure 7. P-tree ANDing using pure-1 path

#### 3.3 Value and Tuple P-trees

By performing the AND operation on the appropriate subset of the basic P-trees and their complements, we can construct P-trees for values with more than one bit.

**Definition 2**: A value P-tree P_{i} (v), is the P-tree of value v at band i. Value v can be expressed in 1-bit upto 8-bit precision.

Value P-trees can be constructed by ANDing basic P-trees or their complements. For example, value P-tree P_{i} (110) gives the count of pixels with band-i bit 1 equal to 1, bit 2 equal to 1 and bit 3 equal to 0, i.e., with band-i value in the range of [192, 224). It can be constructed from the basic P-trees as:

P_{i} (110) = P_{i,1} AND P_{i,2} AND P_{i,3}’

P-trees can also represent data for any value combination from any band, even the entire tuple. In the very same way, we can construct tuple P-trees.

Definition 3: A tuple P-tree P (v_{1}, v_{2}, …, v_{n}), is the P-tree of value v_{i} at band i, for all i from 1 to n. We have, P (v_{1}, v_{2}, …, v_{n}) = P_{1}(v_{1}) AND P_{2}(v_{2}) AND … AND P_{n}(v_{n})

If value v_{j} is not given, it means it could be any value in Band j. For example, P (110, ,101,001, , , ,) stands for a tuple P-tree of value 110 in band 1, 101 in band 3 and 001 in band 4 and any value in any other band.

Definition 4: A interval P-tree P_{i} (v_{1}, v_{2}), is the P-tree for value in the interval of [v_{1}, v_{2}] of band i. We have,

P_{i} (v_{1}, v_{2}) = OR P_{i} (v), for all v in [v_{1}, v_{2}].

Any value P-tree, tuple P-tree can be constructed by performing ANDing on basic P-trees and their complements. Interval P-trees can be constructed by combining AND and OR operations of basic P-trees. All the P-tree operations, including AND, OR, COMPLEMENT and XOR, can be performed on any kinds of Ptrees we defined above. The process of ANDing basic P-trees and their complements to produce value P-trees or tuple P-trees can be done at any level of precision -- 1-bit precision up to 8-bit precision.

### 4. PROPERTIES OF P-TREES

In this section, we will discuss the good properties of P-trees. We will use the following notations:

P_{x}_{ y}, is the pixel with coordinate (x, y), x y i V , , is the value for the band i of the pixel x y p , , x y i j b , , , is the jth bit of x y i V , , (bits are numbered from left to right, x, y,i,0 b is the leftmost bit). Indices: x: column (x-coordinate), y: row(y-coordinate), i: band, j: bit.

For any P-trees P, P_{1} and P_{2}, P_{1} & P_{2} denotes P1 AND P2, P1 | P2 denotes P1 OR P2, P1 ⊕ P2 denotes P1 XOR P2, P′ denotes COMPLEMENT of P.

P_{i, j} is the basic P-tree for bit j of band i, P_{i}(v) is the value P-tree for the value v of band i, P_{i}(v_{1}, v_{2}) is the interval P-tree for the interval [v_{1}, v_{2}] of band I, r_{c}(P) is the root count of P-tree P. P0 is pure-0 tree, P_{1} is pure-1 tree. N is the number of pixels in the image or space under consideration.

Lemma 1: For any two P-trees P_{1} and P_{2}, r_{c}(P_{1} | P_{2}) = 0 ⇒ r_{c}(P_{1}) = 0 and r_{c}(P_{2}) = 0. More strictly, r_{c}(P_{1} | P_{2}) = 0, if and only if r_{c}(P_{1}) = 0 and r_{c}(P_{2}) = 0.

Proof: (Proof by contradiction) Let, r_{c}(P_{1}) ≠ 0. Then, for some pixels there are 1s in P_{1} and for those pixels there must be 1s in P_{1} | P_{2} i.e. r_{c}(P_{1} | P_{2}) ≠ 0, But we assumed r_{c}(P_{1} | P_{2}) = 0. Therefore r_{c}(P_{1}) = 0. Similarly we can prove that r_{c}(P_{2}) = 0.

The proof for the inverse, r_{c}(P_{1}) = 0 and r_{c}(P_{2}) = 0⇒ r_{c}(P_{1} | P_{2}) = 0 is trivial. This immediately follows the definitions.

Lemma 2: a) r_{c}(P_{1}) = 0 or r_{c}(P_{2}) = 0⇒ r_{c}(P_{1} &P_{2}) = 0 b) r_{c}(P_{1}) = 0 and r_{c}(P_{2}) = 0⇒ r_{c}(P_{1} &P_{2}) = 0. c) r_{c}( P_{0} ) = 0 d) r_{c}( P_{1} ) = N e) P& P0 = P0 f) P& P1 = P g) P | P0 = P h) P | P1 = P1 i) P& P'= P0 j) P | P'= P1

Proofs are immediate.

Lemma 3: v_{1} ≠ v_{2}⇒ r_{c}{P_{i} (v_{1}) & P_{i}(v_{2})}=0, for any band i. P-tree-1:

Proof: P_{i} (v) represents all the pixels having value v for the band i. If v_{1} ≠ v_{2}, no pixel can have the values of both v_{1} and v_{2} for the same band. Therefore, if there is a 1 in P_{i} (v_{1}) for any pixel, there must be 0 in P_{i}(v_{2}) for that pixel and vice versa. Hence r_{c}{P_{i} (v_{1}) & P_{i}(v_{2})} = 0.

Lemma 4: r_{c}(P_{1} | P_{2}) = r_{c}(P_{1}) + r_{c}(P_{2}) - r_{c}(P_{1} & P_{2}).

Proof: Let the number of pixels for which there are 1s in P_{1} and 0s in P_{2} is n_{1}, the number of pixels for which there are 0s in P1 and 1s in P_{2} is n_{2} and the number of pixels for which there are 1s in both P_{1} and P_{2} is n_{3}.

Now, r_{c}(P_{1}) = n_{1} + n_{3}, r_{c}(P_{2}) = n_{2} + n_{3}, r_{c}(P_{1} &P_{2}) = n_{3} and r_{c}(P_{1} | P_{2}) = n_{1} + n_{2} + n_{3} = (n_{1} + n_{3}) + (n_{2} + n_{3}) - n_{3} = r_{c}(P_{1}) + r_{c}(P_{2}) - r_{c}(P_{1} &P_{2})

Theorem: r_{c}{P_{i} (v_{1}) |P_{i} (v_{2})} = r_{c}{P_{i} (v_{1})} + r_{c}{P_{i}(v_{2})}, where v_{1} ≠ v_{2}.

Proof: r_{c}{P_{i} (v_{1}) | P_{i}(v_{2})} = r_{c}{P_{i} (v_{1})} + r_{c}{P_{i}(v_{2})} - r_{c}{P_{i} (v_{1}) & P_{i}(v_{2})} (Lemma 4)

If v_{1} ≠ v_{2}, r_{c}{P_{i} (v_{1}) & P_{i}(v_{2})} = 0. (Lemma 3)

Therefore, r_{c}{P_{i} (v_{1}) | P_{i}(v_{2})} = r_{c}{P_{i} (v_{1})} + r_{c}{P_{i}(v_{2})}.

### 5. IMPLEMENTATION ISSUES AND EXPERIMENTAL RESULTS

#### 5.1 P-tree Header

To make a generalized P-tree structure, the following header for a P-tree file is proposed in table 2.

##### Table 2. P-tree header

1 byte | 2 bytes | 1 byte | 4 bytes | 2 bytes | |

Format Code | Fanout | # of levels | Root count | Length of the body | Body of the P-tree |

*Format code*: Format code identifies the format of the P-tree, whether it is a PCT or PMT or in any other format.

*Fan-out*: This field contains the fan-out information of the P-tree. Fan-out information is required to traverse the P-tree in performing various P-tree operations. # of levels: Number of levels in the P-tree. When we encounter a pure1 or pure0 node, we cannot tell whether it is an interior node or a leaf unless we know the level of that node and the total number of levels of the tree. This is required to know the number of 1s represented by a pure1 node.

*Root count*: Root count i.e. the number of 1s in the P-tree. Though we can calculate the root count of a P-tree on the fly from the P-tree data, these only 4 bytes of space can save computation time when we only need the root count of a P-tree. The root count of a P-tree can be computed at time of construction of P-tree with a very little extra cost.

*Length of the body*: Length of the body is the size of the P-tree file in bytes excluding the header. Sometimes we may want to keep the whole P-tree into RAM to increase the efficiency of computation. Since the size of the P-tree varies, we need to allocate memory dynamically, which requires knowing the size of the required memory size before reading the data from disk.

#### 5.2 Performance Analysis

We only store the basic P-trees for each dataset. All other P-trees (value P-trees and tuple P-trees) are created on the fly as needed. This results in a considerable saving of space. Figure 8, 9 and 10 give the storage needs for various formats of data (TIFF, SPOT and TM scene) using various formats of P-trees (PCT or PMT) with different fan-out patterns.

The AND operation of basic P-trees is fast. Figure 11 and 12 show the time required to perform P-tree ANDing. The ANDing time varies from 6.72 ms to 52.12 ms for two TM scenes with size 2048×2048.

##### Figure 8. Comparison of file size for different bits of Band 1 & 2 of a TIFF image

##### Figure 9. Comparison of file size for different bits of Band 3 & 4 of a SOPT image

##### Figure 10. Comparison of file size for different bits of Band 5 & 6 of a TM image

##### Figure 11. Comparison of time required to perform ANDing operation

##### Figure 12. Average time required to perform ANDing operation

### 6. RELATED WORK

Concepts related to the P-tree data structure, include Quadtrees [**1, 2, 3, 4, 5**] and its variants (such as point quadtrees [**3**] and region quadtrees [**4**]), and HH-codes [**6**]. Quadtrees decompose the universe by means of iso-oriented hyperplanes. These partitions do not have to be of equal size, although that is often the case. The decomposition into subspaces is usually continued until the number of objects in each partition is below a given threshold. Quadtrees have many variants, such as point quadtrees and region quadtrees.

HH-codes, or Helical Hyperspatial Codes, are binary representations of the Riemannian diagonal. The binary division of the diagonal forms the node point from which eight sub-cubes are formed. Each sub-cube has its own diagonal, generating new sub-cubes. These cubes are formed by interlacing onedimensional values encoded as HH bit codes. When sorted, they cluster in groups along the diagonal. The clusters are order in a helical pattern, thus the name "Helical Hyperspatial".

The similarities among P-tree, quadtree and HHCode are that they are quadrant based. The difference is that P-trees focus on the count. P-trees are not index, rather they are representations of the datasets themselves. P-trees are particularly useful for data mining because they contain the aggregate information needed for data mining.

### 7. CONCLUSION

In this paper, we describe the properties and algebra of a lossless data structure, Peano Count Tree (P-tree). P-tree structure has nice features so that it can be used for efficient data mining, such as association rule mining, classification and clustering. The details of how to use P-trees in data mining is beyond the scope of this paper. The basic idea is that in the processing of data mining, a lot of counts are needed. With the information in P-trees, these counts can be collected in a fast way by ANDing appropriate Ptrees instead of scanning the entire database. The idea of P-tree initially came from the spatial data, however, it can be extended to represent other kinds of data, such as DNA Microarray data and VLSI data.

### 8. REFERENCES

#### [1]

Volker Gaede and Oliver Gunther, “Multidimensional Access Methods”, Computing Surveys, 30(2), 1998.

#### [2]

H. Samet, “The quadtree and related hierarchical data structure”. ACM Computing Survey, 16, 2, 1984.

#### [3]

H. Samet, “Applications of Sptial Data Structures”, Addison- Wesley, Reading, Mass., 1990.

#### [4]

H. Samet, “The Design and Analysis of Spatial Data Structures”, Addison-Wesley, Reading, Mass., 1990.

#### [5]

R. A. Finkel and J. L. Bentley, “Quad trees: A data structure for retrieval of composite keys”, Acta Informatica, 4, 1, 1974.

#### [6]

HH-codes. Available at http://www.statkart.no/nlhdb/iveher/hhtext.html

#### [7]

William Perrizo, Qin Ding, Qiang Ding and Amalendu Roy, “Deriving High Confidence Rules from Spatial Data using Peano Count Trees”, Springer-Verlag, LNCS 2118, July 2001.

## Comments