PPIC : Tool for Sequencial Pattern Mining using CP
PPIC is a flexible and open source framework for Sequencial Pattern Mining. It uses Constraint programming approach to mine data based on PrefixSpan specialized method. It handles several user-defined constraints such as: Size constraint, Item constraint, Regular expression constraint. Time constraints are available in PPICt framework available here http://sites.uclouvain.be/cp4dm/spm/ppict/.
PPIC is carried out in OscaR Solver.
- > Download
- User version : oscar.ppic.1.0.0.jar, Binary pack
- Developper link : https://bitbucket.org/pschaus/cp4d
- Datasets : dataset.tar.gz, Protein.txt, data200k.txt
- Useful scripts : scripts.tar.gz
- > Relevant publications
- An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming, Aoga, John O. R., Guns Tias, and Schaus Pierre , Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II, Cham, p.315–330, (2016). [PDF][Slides]
Usage
The datasets that can be used for PPIC is dataset.sdb. Script to convert dataset to the right one is available in scripts.tar.gz. Some datasets are available in dataset.tar.gz.
$java -jar oscar.spm-SNAPSHOT.jar [options] <SDB File> <Lmin > < Lmax>
<SDB File> the input sequential database <Lmin > the minimum size of patterns (minimum value 1) < Lmax> the (max) size of patterns (value 0 representing the longest sequence in SDB)[options] -f minfreq | --minfreq minfreq the absolute frequency threshold (default value: 1) -s minsup | --minsup minsup the relative frequency threshold (default 0.25 = 50%) -i item1=number1,item2=number2,... | --items item1=number1,item2=number2,... items constraint to specify what items must be or not in pattern, give items with their numbers of occurence seperate by ',' (number 0 to forbid item) -e <6 or 10 or 14> | --RE <6 or 10 or 14> regular expression specified by a number equal to 6, 10 or 14 -a <0 or 1 or 2> | --algo <0 or 1 or 2> specified by a algorith you would like to use PPIC=0 (default), PPDC=1, or PPmixed=2 -o <file> | --out <file> output the solutions -v | --verbose output all result with every details -pp | --prefix-pattern output prefix-closed pattern SPM -y <0 or 1 or 2 or 3> | --fix <0 or 1 or 2 or 3> specified by a fix you would like to use PPIC with all improvements=0 (default), PPIC+fix(4+1+3)=1, PPIC+fix(4+1)=2, or PPIC+fix(4)=3 --help Show how to use this application Note : RE for data200k.txt RE6 = A*B(B|C)D*E RE10 = A*B(B|C)D*EF*(G|H)I* RE14 = A*(Q|BS*(B|C))D*E(I|S)*(F|H)G*R
Examples
The input data of these examples is test.txt which is dataset. Each line is a sequence and item in sequence is separated by space.
test.txt = 1 2 3 2 3 2 1 2 3 1 2 2 3 4
Extracting sequential patterns
Given minimum threshold (minsup = 2 or 50%), find all patterns with minimum length (Lmin=1).
$java -jar oscar.ppic.1.0.0.jar test.txt 1 0 -f 2
or $java -jar oscar.ppic.1.0.0.jar test.txt 1 0 -s 0.5
or $java -jar oscar.ppic.1.0.0.jar test.txt 1 0 --minsup 0.5[output] < 1 > : 3 < 1 2 > : 3 < 1 2 3 > : 2 < 1 3 > : 2 < 2 > : 4 < 2 2 > : 2 < 2 2 3 > : 2 < 2 3 > : 3 < 3 > : 3
Extracting sequential patterns under size constraint
Given minimum threshold (minsup = 2 or 50%), find all patterns with minimum length (Lmin=3) and maximum length (Lmax=3).
$java -jar oscar.ppic.1.0.0.jar test.txt 3 3 -f 2[output] < 1 2 3 > : 2 < 2 2 3 > : 2
Extracting sequential patterns under size constraint and item constraint
> Given minimum threshold (minsup = 2 or 50%), find all patterns with minimum length (Lmin=2) and maximum length (Lmax=3) which are contain item 3 once
$java -jar oscar.ppic.1.0.0.jar test.txt 2 3 -f 2 -i 3=1[output] < 1 2 3 > : 2 < 1 3 > : 2 < 2 2 3 > : 2 < 2 3 > : 3
Extracting prefix-closed sequential patterns
Given minimum threshold (minsup = 2 or 50%), find all prefix-closed patterns.
$java -jar oscar.ppic.1.0.0.jar test.txt 1 0 -f 2 -pp[output] < 1 2 3 > : 2 < 1 3 > : 2 < 2 2 3 > : 2 < 2 3 > : 3 < 3 > : 3