PPIC : Tool for Sequencial Pattern Mining using CP

PPIC is a flexible and open source framework for Sequencial Pattern Mining. It uses Constraint programming approach to mine data based on PrefixSpan specialized method. It handles several user-defined constraints such as: Size constraint, Item constraint, Regular expression constraint. Time constraints are available in PPICt framework available here http://sites.uclouvain.be/cp4dm/spm/ppict/.

PPIC is carried out in OscaR Solver.

Usage

The datasets that can be used for PPIC is dataset.sdb. Script to convert dataset to the right one is available in scripts.tar.gz. Some datasets are available in dataset.tar.gz.

  $java -jar oscar.spm-SNAPSHOT.jar [options] <SDB File> <Lmin > < Lmax>
<SDB File> the input sequential database <Lmin > the minimum size of patterns (minimum value 1) < Lmax> the (max) size of patterns (value 0 representing the longest sequence in SDB) [options] -f minfreq | --minfreq minfreq the absolute frequency threshold (default value: 1) -s minsup | --minsup minsup the relative frequency threshold (default 0.25 = 50%) -i item1=number1,item2=number2,... | --items item1=number1,item2=number2,... items constraint to specify what items must be or not in pattern, give items with their numbers of occurence seperate by ',' (number 0 to forbid item) -e <6 or 10 or 14> | --RE <6 or 10 or 14> regular expression specified by a number equal to 6, 10 or 14 -a <0 or 1 or 2> | --algo <0 or 1 or 2> specified by a algorith you would like to use PPIC=0 (default), PPDC=1, or PPmixed=2 -o <file> | --out <file> output the solutions -v | --verbose output all result with every details -pp | --prefix-pattern output prefix-closed pattern SPM -y <0 or 1 or 2 or 3> | --fix <0 or 1 or 2 or 3> specified by a fix you would like to use PPIC with all improvements=0 (default), PPIC+fix(4+1+3)=1, PPIC+fix(4+1)=2, or PPIC+fix(4)=3 --help Show how to use this application Note : RE for data200k.txt RE6 = A*B(B|C)D*E RE10 = A*B(B|C)D*EF*(G|H)I* RE14 = A*(Q|BS*(B|C))D*E(I|S)*(F|H)G*R

Examples

The input data of these examples is test.txt which is dataset. Each line is a sequence and item in sequence is separated by space.

 test.txt = 
1 2 3 2 3
2 1 2 3
1 2
2 3 4
                

Extracting sequential patterns

Given minimum threshold (minsup = 2 or 50%), find all patterns with minimum length (Lmin=1).

    $java -jar oscar.ppic.1.0.0.jar test.txt 1 0 -f 2
or $java -jar oscar.ppic.1.0.0.jar test.txt 1 0 -s 0.5
or $java -jar oscar.ppic.1.0.0.jar test.txt 1 0 --minsup 0.5
[output] < 1 > : 3 < 1 2 > : 3 < 1 2 3 > : 2 < 1 3 > : 2 < 2 > : 4 < 2 2 > : 2 < 2 2 3 > : 2 < 2 3 > : 3 < 3 > : 3

Extracting sequential patterns under size constraint

Given minimum threshold (minsup = 2 or 50%), find all patterns with minimum length (Lmin=3) and maximum length (Lmax=3).

    $java -jar oscar.ppic.1.0.0.jar test.txt 3 3 -f 2
[output] < 1 2 3 > : 2 < 2 2 3 > : 2

Extracting sequential patterns under size constraint and item constraint

> Given minimum threshold (minsup = 2 or 50%), find all patterns with minimum length (Lmin=2) and maximum length (Lmax=3) which are contain item 3 once

    $java -jar oscar.ppic.1.0.0.jar test.txt 2 3 -f 2 -i 3=1
[output] < 1 2 3 > : 2 < 1 3 > : 2 < 2 2 3 > : 2 < 2 3 > : 3

Extracting prefix-closed sequential patterns

Given minimum threshold (minsup = 2 or 50%), find all prefix-closed patterns.

    $java -jar oscar.ppic.1.0.0.jar test.txt 1 0 -f 2 -pp
[output] < 1 2 3 > : 2 < 1 3 > : 2 < 2 2 3 > : 2 < 2 3 > : 3 < 3 > : 3