Introduction
The Model Checking Contest is a yearly scientific event dedicated to the assessment of formal verification tools for concurrent systems.
The Model Checking Contest has two different parts: the Call for Models, which gathers Petri net models proposed by the scientific community, and the Call for Tools, which benchmarks verification tools developed within the scientific community.
I. Rules specific to Model Submission and their Publication
M-1.The Organizers of the Call for Models of the 2016 edition of the Model Checking Contest are: Hubert Garavel, Lom Hillah, and Fabrice Kordon.
M-2. The detailed instructions for proposing a model are given in the Call for Models and the model submission toolkit.
M-3. Any model submitted should be in the public domain. Any model that is proprietary, copyrighted, and/or confidential should not be submitted. Persons who submit a model hereby allow the organizers to freely use this model and publish it in the web if this model is selected.
M-4. If a submitted model is improper (e.g., invalid PNML file, missing or incomplete model description form, etc.), the organizers may contact the person(s) who submitted the model and ask for corrections. If corrections are not done or are unsatisfactory, the organizers can reject the model.
M-5. Among all submitted models, the Organizers will select approximately a dozen, according to criteria such as: originality, complexity, diversity with respect to other models, industrial relevance, etc.
M-6. The list of models selected for the 2016 edition of the Model Checking Contest will be kept confidential by the Organizers until all tools have been submitted, so that selected models are not known in advance by tool developers.
M-7. When the tool contest start, the selected models will be published on the web site and the names of the authors of selected models will be mentioned on the same web site. The Organizers of the Call for Models will not submit tools themselves, and will ensure fair competition between all tool developers competing in the Call for Tools.
II. Rules specific to Tool Submission and their Publication
T-1.The Organizers of the Call for Tools of the 2016 edition of the Model Checking Contest are: Francis Hulin-Hubard, and Fabrice Kordon.
T-2. Developers of a tool (or a tool extension) can participate with their own work. Tools may also be submitted by others if they present the latest version available on the web.
T-3. All MCC models are provided in PNML, which is the input format for Petri nets. In principle, each submitted tool should be able to read these PNML files; if a tool is not capable of reading PNML, it may read alternative files provided by the submitter(s) in formats other than PNML. This only holds for «known» models that are known when the call for tool is issued ; for «surprise» models, the use of PNML is mandatory
T-4. It is allowed to participate for certain models only and/or certain examinations only (e.g., state space generation, some types of formulas). See the Submission Manual for details.
T-5. Submitted tools must honestly perform the computations required by the MCC examinations. Using precomputed results is not allowed and, if detected, will disqualify the tool submission.
T-6. Each examination must be computed independently, i.e. without using previous results of an earlier computation. However, when a file contains several formulas, the tool may reuse the results of certain formulas to compute other formulas.
T-7. Participants authorize the publication of the MCC results.
III. Rules specific to the Evaluation and Scoring of Tools
This section defines the way we intend to operate tools, evaluate tools, and compute scores.
III.1. Description of the Examinations for 2016
This section defines the examinations for 2016, as well as the outputs expected by tools for these examinations.
E-1.1. There are five categories of examinations: StateSpace, UpperBounds, Reachability formulas, CTL formulas, and LTL formulas.
E-1.2. For the StateSpace category, each participating tool must provide the following results: number of states in the marking graph, number of transitions firing in the marking graph, the maximum number of tokens per marking in the net, and the maximum number of tokens that can be found in a place.
E-1.3. For the UpperBounds category, each participating tool must provide the bound of places designated in a formula as an integer value, specifying the upper bound of the place. The provided value is expected to be exact upper bound of the corresponding place in the system.
For each examination of the UpperBounds category, 16 formulas (selected places) are proposed.
E-1.4. The Reachability formulas category contains three subcategories : ReachabilityDeadlock, ReachabilityFireability, and ReachabilityCardinality. Each tool must provide (when it answers):
- for ReachabilityDeadlock, a boolean TRUE or FALSE indicating the existence (or not) of at least one deadlock state in the system,
- for ReachabilityFireability and ReachabilityCardinality, a sequence of booleans TRUE or FALSE stating whether a formula is satisfied.
The differences between the three subcategories only relies in the atomic propositions of the formula. ReachabilityDeadlock corresponds to a simple formula stating the existence of a deadlock state. In ReachabilityFireability, atomic propositions are boolean combinations of propositions checking for the firability of transitions. In ReachabilityCardinality atomic propositions are boolean combinations of propositions comparing the number of tokens in places.
For each examination of the ReachabilityFireability and ReachabilityCardinality categories, 16 formulas are proposed.
E-1.5. The CTL formulas category contains two subcategories : CTLFireability, and CTLCardinality. Each tool must provide (when it answers), a sequence of booleans TRUE or FALSE stating whether a formula is satisfied.
The differences between the two subcategories only relies in the atomic propositions of the formula. In CTLFireability, atomic propositions are boolean combinations of propositions checking for the firability of transitions. In CTLCardinality, atomic propositions are boolean combinations of propositions comparing the number of tokens in places.
For each examination of the CTL category, 16 formulas are proposed.
E-1.6. The LTL formulas category contains two subcategories : LTLFireability, and LTLCardinality. Each tool must provide (when it answers), a sequence of booleans TRUE or FALSE stating whether a formula is satisfied.
The differences between the two subcategories only relies in the atomic propositions of the formula. In LTLFireability, atomic propositions are boolean combinations of propositions checking for the firability of transitions. In LTLCardinality, atomic propositions are boolean combinations of propositions comparing the number of tokens in places.
For each examination of the LTL category, 16 formulas are proposed.
E-1.7. If a tool participates in an examination in a category, its answers must be returned using a dedicated keyword. If the tool does not participate, it must state it explicitly (see the submission manual). If the tool cannot compute a result, it must state it explicitly with a dedicated keyword. When the analysis script does not find any appropriately formated answer, it will assume that the tool could not compute anything.
All instructions with regards to outputs are presented in the submission manual. These instructions must be respected. Mistakes in the automatic evaluation of results are not the responsibility of the MCC organizers.
E-1.8. For all examinations involving formulas, the reference is the XML representation of these formulas, the textual format is just provided to ease the readability.
III.2. Evaluation of Tools Confidence in 2016
This section defines the notion of tool Confidence, to be computed during a preliminary pass, once all the tools have been processed for all the examinations (and have therefore provided all the expected results).
E-2.1. The Confidence of a tool is stated by a value Rtool ∈ [0, 1]. Where 1 means the tool always finds the commonly agreed results within a set of consistent values. The value 0 means that the tool never finds the commonly agreed result within the same set of consistent values.
E-2.2. The evaluation of the participating tools Confidence is performed during a preliminary pass that checks for all the values provided by tools. A value is a piece of result provided by a tool for a given examination. For example, if an examination requires N values, then, each of these will be considered separately to evaluate tools Confidence. Typically, the state space examination requires tools to compute 4 values that are to be considered separately when evaluating the tools Confidence.
From the full list of values provided by tools for the Model Checking Contest, let us consider V, the subset of consistent values that are defined in the following way:
- A majority of tools agree on the same value,
- and at least three tools agree on this value.
Let us now consider Vtool, the set of values provided by a given tool for the consistent set of values V (not computed results are of course not considered), that also agree with those of V. Then, the Confidence of a tool is computed as follows :
Ctool = | ⎪Vtool⎪ |
⎪V⎪ |
where ⎪Set⎪ represents the cardinality of the set.
III.3. Computing the Reference Values in 2016
This section deals with how we compute the reference values that are used to evaluate the results provided by tools. The objective is to determine the correct result that is a priori unknown in most cases. We must consider two cases: when the expected result is a value and when the expected result is an interval.
E-3.1. For a given examination where exact values are expected, all the values provided by the participating tools (e.g those which answered something else than «cannot compute» or «do not compete») are examined to evaluate the reference one, according to the majority of the participating tools. In that process, each tool is weighted by its Confidence rate Rtool. Several situations are considered:
- All tools agree, then the reference value is a global agreement,
- There is a weighted majority of tools that agree on one value that is considered to be the reference one.
- Only two tools (t1 and t2) provide different values. If Rt1 > Rt2, the value provided by t1 is considered. If Rt2 > Rt1, the value provided by t2 is considered. Otherwise, the reference value is declared to be unknown.
- Only one tool provides a value. Then, if Rtool > 0.93, the value is considered to be the reference value. Otherwise, the reference value is declared to be unknown.
- No tool provides any value, then the reference value is declared to be unknown.
E-3.2. This rule is for information purposes only in 2016 because, after a pool for the pre-registered tools developers concerning the UpperBounds examination, all tool are able to compute an exact upper bound for places. So far, manipulation of intervals appear to be useless but it might be activated in future editions of the model checking contest.
For a given examination where an interval is expected as a result, a more complex algorithm must be defined. First, let us define that:
- N is the set of natural numbers {0, 1, 2, ...},
- for a given expected value, we get, from T tools, a list of answers noted [ai, bi], where i varies from 1 to T.
We can then define a function f: N → [0..T] as follows: for any x in N, we set f(x) to be the number of tools that returned an interval that contains x. Formally, we have f(x) = ⎪{i ∈ {1..T}: x ∈ [ai, bi]}⎪.
Let us observe that the intersection of all intervals is not empty, if there exists some x ∈ N such that f(x) =n. We now consider y ∈ [1,t] to be the maximum of function f, so that y = max({x ∈ N:f(x)}). We can then define the reference set S as follows:
S = {x ∈ N: f(x) = y}
The reference set S is the set of naturals on which the maximum number of tools agree to locate the computed value. It is the best guess we use to evaluate the accuracy of results provided by the participating tools.
III.4. Scoring for 2016
This section provides information on how we compute a score from a value provided by a tool. Please note that some details may change in the final evaluation (due to implementation matters fo example).
E-4.1. Let us define the following constants that are used to compute scores in the model checking contest:
- ValueScore: this is the number of points a tool gets when it computes a value correctly in an examination. The corresponding number depends on the examination, the total number of points a tool can receive from an examination being 16 (e.g. when 16 formulas are provided, each correctly computed formula is rewarded with 1 point).
- ValuePenalty: this is the number of points a tool looses when a value is considered to be wrong. It is worth twice the points when a tool has provided the correct value.
- ExaminationBonusCPU: this is the number of extra points a tool receives for being the fastest to compute an examination (the result must not contain an erroenous value). The number associated to this constant is 4 (ValueScore ÷ 4).
- ExaminationBonusMemory: this is the number of extra points a tool gets for having the lowest memory footprint when computing an examination (the result must not contain an erroenous value). The number associated to this constant is 4 (ValueScore ÷ 4).
E-4.2.When a tool provides a value corresponding to the reference value for an examination, it gets ValueScore points. The tool gets 0 points for non computed values. The score for the examination is a sum of the scores obtained for the values, unless there is a mistake, as stated in rule E-4.3.
E-4.3. To get a score for an examination, a tool must provide at least one correct value. If no value is provided (the tool states «cannot compute» or «do not compete»), the score for the corresponding examination is 0. For each wrong value provided, the tools gets the (penalty) score of ValuePenalty points. To get bonus points for the examination (CPU and/or memory), a tool must:
- compute at least half of the values required for the examination, and
- provide no erroneous value.
E-4.4. Score values for «Known» models is ×1, scores for «Stripped» models is ×3, scores for «surprise» is × 5.
E-4.5. For each category, a score is computed as the sum of all the examination scores in this category. Then, tools are ranked in this category with respect to this sum. If a tool is submitted with several variants, then, the best variant for the current category is considered for the podium only (the socre of others are displayed for information).
III.5. Execution Conditions for Tools in 2016
This section defines the execution conditions for all tools.
E-5.1. For a given examination, each tool is executed in a virtual machine. This machine is booted from a reference disk image, then the calculus is performed, results are retrieved and the machine is halted so that no information can be saved from one execution to another one.
For a given model, all examinations on all instances are processed for all tools on the same physical machine so that results can be compared fairly in terms of performances (CPU, memory).
For each examination, there is a CPU time limit of one hour. Once this time limit is reached, the virtual machine is halted and non provided values are considered as "cannot compute".
E-5.2. For sequential tools, the virtual machine characteristics are defined as follows:
- 16 GB of memory,
- 1 core.
E-5.3. For concurrent tools, the virtual machine characteristics are defined as follows:
- 16 GB of memory,
- 4 cores.
IV. Miscellaneous Provisions
P-1. In case of technical problems or litigious situations, participants should first discuss with the organizers of the Call for Models, or Call for Tools, respectively.
P-2. Any issue that cannot be solved by discussion with the organizers (call for models, call for tools) should be brought before Didier Buchs, who is one of the two General Chairs.
P-3. Withdrawal of a participating tool by the organizers at any moment is possible in case of not respecting the present rules.
P-4. A participating tool can resign at any moment, it will be completely removed from the record of the contest.
P-5. The Model Checking Contest is governed by French laws. Only the justice courts of Paris (France) are legally competent for settling disputes related to this contest.