PPICt : Tool for Sequencial Pattern Mining under time constraints (gap/span) using CP
PPICt relies on PPIC CP-based method. It handles several other user-defined constraints such as: pattern length constraint (min and max), Item inclusion/exclusion and full Regular expression constraint.
PPICt is carried out in OscaR Solver.
Experiments are available Here
- > Download
- User version : oscar.ppicgap.1.0.0.jar, Binary pack
- Developper link : https://bitbucket.org/projetsJOHN/cp4dt
- Datasets : datasetTime.tar.gz, Protein.txt, data200k.txt
- > Relevant publications
- Mining Time-constrained Sequential Patterns with Constraint Programming, Aoga, John O. R., Guns Tias, and Schaus Pierre , Constraints, Volume 22, (2017). [PDF][Slides]
Usage
The datasets that can be used for PPICt is dataset.sdb.sp. Script to convert dataset to the right one is available in scripts.tar.gz. Some datasets are available in dataset.tar.gz. The option -t is required for all time constraints and datasets .
$java -jar oscar.spm.ppict.jar [options] <SDB File> <Lmin > < Lmax>
<SDB File> the input sequential database <Lmin > the minimum size of patterns (minimum value 1) < Lmax> the (max) size of patterns (value 0 representing the longest sequence in SDB)[options] -f minfreq | --minfreq minfreq the absolute frequency threshold (default value: 1) -s minsup | --minsup minsup the relative frequency threshold (default 0.25 = 50%) -i item1=number1,item2=number2,... | --items item1=number1,item2=number2,... items constraint to specify what items must be or not in pattern, give items with their numbers of occurence seperate by ',' (number 0 to forbid item) -e <6 or 10 or 14> | --RE <6 or 10 or 14> regular expression specified by a number equal to 6, 10 or 14 -r <regexpstring> | --regexp <regexpstring> regular expression specified by a string with letter (starting by A), the correspondance of letter must be defined with --regexp-letter option <regexp-letter> give correspondance between letter in regular expression and item seperated by ',' all the letters in the regular expression must have a correspondance -t | --time to precise the database is with timestamps and take gap and span constraint into account -m <minGap> | --mingap <minGap> to specify the minimum gap >= 0 (default value: 0 to disable the minimum gap constraint) -n <maxGap> | --maxgap <maxGap> to specify the maximum gap >= 0 (default value: 0 to disable the maximum gap constraint) -w <minspan> | --minspan <minspan> to specify the maximum span >= 0 (default value: 0 to disable the maximum span constraint) -y <maxspan> | --maxspan <maxspan> to specify the maximum span >= 0 (default value: 0 to disable the maximum span constraint) -o <file> | --out <file> output the solutions -v | --verbose output all result with every details Note : RE for data200k.txt RE6 = A*B(B|C)D*E RE10 = A*B(B|C)D*EF*(G|H)I* RE14 = A*(Q|BS*(B|C))D*E(I|S)*(F|H)G*R
Examples
The input data of these examples is testtime.txt which is dataset. Each line is a sequence id, the timestamp, the size and item separated by a space.
testtime.txt = <sid> <timestamps> <size_of_itemset> <itemset> 1 2 1 10 1 5 1 20 1 6 1 40 1 10 1 30 1 11 1 20 2 1 1 20 2 2 1 10 2 9 1 10 2 12 1 40 2 15 1 30 2 18 1 10 2 24 1 20 3 2 1 10 3 4 1 20 3 6 1 40 3 8 1 40 3 10 1 20 3 12 1 50 3 14 1 30 4 1 1 10 4 2 1 30 4 3 1 30 4 4 1 20
Extracting sequential patterns under gap
Given minimum threshold (minsup = 2 or 50%) and gap [3,7], find all patterns with minimum length (Lmin=1).
$java -jar oscar.spm.ppict.jar testtime.txt 1 0 -f 2 -t -m 3 -n 7[output]<sup> : <patern> 4 : < 10 > 3 : < 10 20 > 3 : < 10 40 > 3 : < 10 40 30 > 4 : < 20 > 2 : < 20 20 > 2 : < 20 30 > 3 : < 40 > 2 : < 40 20 > 3 : < 40 30 > 4 : < 30 >
Extracting sequential patterns under gap and span
Given minimum threshold (minsup = 2 or 50%), gap[3,7], span[0,3] find all patterns.
$java -jar oscar.spm.ppict.jar testtime.txt 1 0 -f 2 -t -m 3 -n 7 -y 3[output]<sup> : <patern> 4 : < 10 > 2 : < 10 20 > 4 : < 20 > 3 : < 40 > 4 : < 30 >
Extracting sequential patterns under gap and regular expression constraint
> Given minimum threshold (minsup = 2 or 50%), gap[3,7] and regular expression "<20>+<30>*" find all patterns with length greater than 2(Lmin=2)
$java -jar oscar.spm.ppict.jar testtime.txt 2 0 -f 2 -t -m 3 -n 7 -r "A+B*" A=20,B=30[output]<sup> : <patern> 2 : < 20 20 > 2 : < 20 30 >
Experiments
Our work is carried in OscaR Solver, implemented in Scala and are run under JVM with maximum memory set to 8GB. We used a machine with processor 2.7Hz Intel core i5 and 8GB of RAM with Linux 3.19.0-32-generic 64 bits distribution Mint 17.3. Execution time limit is set to 3600 seconds (1 hour).
Our proposals are compared with:
- GapSeq: Kemmar, A., Loudni, S., Lebbah, Y., Boizumault, P., Charnois, T.: A global constraint for mining sequential patterns with gap constraint. arXiv preprint arXiv:1511.08350 (2015)
- PPIC: Aoga J.O.R, Guns T., and Schaus P. An efficient algorithm for mining frequent sequence with constraint programming. ECML-PKDD 2016, 2016. [PDF]
- cSPADE: Zaki, M.J.: Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the ninth international conference on Information and knowledge management, ACM (2000) 422–429