PPICt : Tool for Sequencial Pattern Mining under time constraints (gap/span) using CP

PPICt relies on PPIC CP-based method. It handles several other user-defined constraints such as: pattern length constraint (min and max), Item inclusion/exclusion and full Regular expression constraint.

PPICt is carried out in OscaR Solver.

Experiments are available Here

Usage

The datasets that can be used for PPICt is dataset.sdb.sp. Script to convert dataset to the right one is available in scripts.tar.gz. Some datasets are available in dataset.tar.gz. The option -t is required for all time constraints and datasets .

  $java -jar oscar.spm.ppict.jar [options] <SDB File> <Lmin > < Lmax>
<SDB File> the input sequential database <Lmin > the minimum size of patterns (minimum value 1) < Lmax> the (max) size of patterns (value 0 representing the longest sequence in SDB) [options] -f minfreq | --minfreq minfreq the absolute frequency threshold (default value: 1) -s minsup | --minsup minsup the relative frequency threshold (default 0.25 = 50%) -i item1=number1,item2=number2,... | --items item1=number1,item2=number2,... items constraint to specify what items must be or not in pattern, give items with their numbers of occurence seperate by ',' (number 0 to forbid item) -e <6 or 10 or 14> | --RE <6 or 10 or 14> regular expression specified by a number equal to 6, 10 or 14 -r <regexpstring> | --regexp <regexpstring> regular expression specified by a string with letter (starting by A), the correspondance of letter must be defined with --regexp-letter option <regexp-letter> give correspondance between letter in regular expression and item seperated by ',' all the letters in the regular expression must have a correspondance -t | --time to precise the database is with timestamps and take gap and span constraint into account -m <minGap> | --mingap <minGap> to specify the minimum gap >= 0 (default value: 0 to disable the minimum gap constraint) -n <maxGap> | --maxgap <maxGap> to specify the maximum gap >= 0 (default value: 0 to disable the maximum gap constraint) -w <minspan> | --minspan <minspan> to specify the maximum span >= 0 (default value: 0 to disable the maximum span constraint) -y <maxspan> | --maxspan <maxspan> to specify the maximum span >= 0 (default value: 0 to disable the maximum span constraint) -o <file> | --out <file> output the solutions -v | --verbose output all result with every details Note : RE for data200k.txt RE6 = A*B(B|C)D*E RE10 = A*B(B|C)D*EF*(G|H)I* RE14 = A*(Q|BS*(B|C))D*E(I|S)*(F|H)G*R

Examples

The input data of these examples is testtime.txt which is dataset. Each line is a sequence id, the timestamp, the size and item separated by a space.

 testtime.txt = <sid> <timestamps> <size_of_itemset> <itemset>
1 2 1 10
1 5 1 20
1 6 1 40
1 10 1 30
1 11 1 20
2 1 1 20
2 2 1 10
2 9 1 10
2 12 1 40
2 15 1 30
2 18 1 10
2 24 1 20
3 2 1 10
3 4 1 20
3 6 1 40
3 8 1 40
3 10 1 20
3 12 1 50
3 14 1 30
4 1 1 10
4 2 1 30
4 3 1 30
4 4 1 20
                

Extracting sequential patterns under gap

Given minimum threshold (minsup = 2 or 50%) and gap [3,7], find all patterns with minimum length (Lmin=1).

    $java -jar oscar.spm.ppict.jar testtime.txt 1 0 -f 2 -t -m 3 -n 7 

[output]<sup> : <patern> 4 : < 10 > 3 : < 10 20 > 3 : < 10 40 > 3 : < 10 40 30 > 4 : < 20 > 2 : < 20 20 > 2 : < 20 30 > 3 : < 40 > 2 : < 40 20 > 3 : < 40 30 > 4 : < 30 >

Extracting sequential patterns under gap and span

Given minimum threshold (minsup = 2 or 50%), gap[3,7], span[0,3] find all patterns.

    $java -jar oscar.spm.ppict.jar testtime.txt 1 0 -f 2 -t -m 3 -n 7 -y 3
[output]<sup> : <patern> 4 : < 10 > 2 : < 10 20 > 4 : < 20 > 3 : < 40 > 4 : < 30 >

Extracting sequential patterns under gap and regular expression constraint

> Given minimum threshold (minsup = 2 or 50%), gap[3,7] and regular expression "<20>+<30>*" find all patterns with length greater than 2(Lmin=2)

    $java -jar oscar.spm.ppict.jar testtime.txt 2 0 -f 2 -t -m 3 -n 7 -r "A+B*" A=20,B=30
[output]<sup> : <patern> 2 : < 20 20 > 2 : < 20 30 >

Experiments

Our work is carried in OscaR Solver, implemented in Scala and are run under JVM with maximum memory set to 8GB. We used a machine with processor 2.7Hz Intel core i5 and 8GB of RAM with Linux 3.19.0-32-generic 64 bits distribution Mint 17.3. Execution time limit is set to 3600 seconds (1 hour).

Our proposals are compared with:

  1. CP-Based Approaches : GapSeq and PPIC
  2. Specialized Approaches : cSPADE
  • GapSeq: Kemmar, A., Loudni, S., Lebbah, Y., Boizumault, P., Charnois, T.: A global constraint for mining sequential patterns with gap constraint. arXiv preprint arXiv:1511.08350 (2015)
  • PPIC: Aoga J.O.R, Guns T., and Schaus P. An efficient algorithm for mining frequent sequence with constraint programming. ECML-PKDD 2016, 2016. [PDF]
  • cSPADE: Zaki, M.J.: Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the ninth international conference on Information and knowledge management, ACM (2000) 422–429

CPU times when considering minimum and maximum gap constraints for several minsup (missing points indicate a timeout)

cp

CPU times when considering both gap and span constraints for several minsup (missing points indicate a timeout)

cp

PU times for several maximum gap with fixed minsup over Bible, Fifa, Leviathan and PubMed datasets

cp

Comparing PPICt without time restriction (PPICt[0,Inf]) with PPIC

cp