phase 2 complete - skills ported, with smoke tests

This commit is contained in:
Willem van den Ende 2026-04-22 14:15:58 +01:00
parent 2c06838202
commit 701c7594fd
14 changed files with 1684 additions and 8 deletions

876
.pi/skills/distill/SKILL.md Normal file
View File

@ -0,0 +1,876 @@
---
name: distill
description: "Extract an Allium specification from an existing codebase. Use when the user has existing code and wants to distil behaviour into a spec, reverse engineer a specification from implementation, generate a spec from code, turn implementation into a behavioural specification, or document what a codebase does in Allium terms."
disable-model-invocation: true
license: MIT
metadata:
upstream: https://github.com/juxt/allium
version: 3
---
# Distillation guide
This guide covers extracting Allium specifications from existing codebases. The core challenge is the same as forward elicitation: finding the right level of abstraction. In elicitation you filter out implementation ideas as they arise. In distillation you filter out implementation details that already exist. Both require the same judgement about what matters at the domain level.
Code tells you *how* something works. A specification captures *what* it does and *why* it matters. The skill is asking "why does the stakeholder care about this?" and "could this be different while still being the same system?"
## Scoping the distillation effort
Before diving into code, establish what you are trying to specify. Not every line of code deserves a place in the spec.
### Questions to ask first
1. **"What subset of this codebase are we specifying?"**
Mono repos often contain multiple distinct systems. You may only need a spec for one service or domain. Clarify boundaries explicitly before starting.
2. **"Is there code we should deliberately exclude?"**
- **Legacy code**: features kept for backwards compatibility but not part of the core system
- **Incidental code**: supporting infrastructure that is not domain-level (logging, metrics, deployment)
- **Deprecated paths**: code scheduled for removal
- **Experimental features**: behind feature flags, not yet design decisions
3. **"Who owns this spec?"**
Different teams may own different parts of a mono repo. Each team's spec should focus on their domain.
### The "Would we rebuild this?" test
For any code path you encounter, ask: "If we rebuilt this system from scratch, would this be in the requirements?"
- Yes: include in spec
- No, it is legacy: exclude
- No, it is infrastructure: exclude
- No, it is a workaround: exclude (but note the underlying need it addresses)
### Documenting scope decisions
At the top of a distilled spec, document what is included and excluded:
```
-- allium: 3
-- interview-scheduling.allium
-- Scope: Interview scheduling flow only
-- Includes: Candidacy, Interview, InterviewSlot, Invitation, Feedback
-- Excludes:
-- - User authentication (use auth library spec)
-- - Analytics/reporting (separate spec)
-- - Legacy V1 API (deprecated, not specified)
-- - Greenhouse sync (use greenhouse library spec)
```
The version marker (`-- allium: N`) must be the first line of every `.allium` file. Use the current language version number.
## Finding the right level of abstraction
Distillation and elicitation share the same fundamental challenge: choosing what to include. The tests below work in both directions, whether you are hearing a stakeholder describe a feature or reading code that implements it.
### The "Why" test
For every detail in the code, ask: "Why does the stakeholder care about this?"
| Code detail | Why? | Include? |
|-------------|------|----------|
| Invitation expires in 7 days | Affects candidate experience | Yes |
| Token is 32 bytes URL-safe | Security implementation | No |
| Sessions stored in Redis | Performance choice | No |
| Uses PostgreSQL JSONB | Database implementation | No |
| Slot status changes to 'proposed' | Affects what candidate sees | Yes |
| Email sent when invitation accepted | Communication requirement | Yes |
If you cannot articulate why a stakeholder would care, it is probably implementation.
### The "Could it be different?" test
Ask: "Could this be implemented differently while still being the same system?"
- If yes: probably implementation detail, abstract it away
- If no: probably domain-level, include it
| Detail | Could be different? | Include? |
|--------|---------------------|----------|
| `secrets.token_urlsafe(32)` | Yes, any secure token generation | No |
| 7-day invitation expiry | No, this is the design decision | Yes |
| PostgreSQL database | Yes, any database | No |
| "Pending, Confirmed, Completed" states | No, this is the workflow | Yes |
### The "Template vs Instance" test
Is this a **category** of thing, or a **specific instance**?
| Instance (often implementation) | Template (often domain-level) |
|--------------------------------|-------------------------------|
| Google OAuth | Authentication provider |
| Slack webhook | Notification channel |
| SendGrid API | Email delivery |
| `timedelta(hours=3)` | Confirmation deadline |
Sometimes the instance IS the domain concern. See "The concrete detail problem" below.
## The distillation mindset
### Code is over-specified
Every line of code makes decisions that might not matter at the domain level:
```python
# Code tells you:
def send_invitation(candidate_id: int, slot_ids: List[int]) -> Invitation:
candidate = db.session.query(Candidate).get(candidate_id)
slots = db.session.query(InterviewSlot).filter(
InterviewSlot.id.in_(slot_ids),
InterviewSlot.status == 'confirmed'
).all()
invitation = Invitation(
candidate_id=candidate_id,
token=secrets.token_urlsafe(32),
expires_at=datetime.utcnow() + timedelta(days=7),
status='pending'
)
db.session.add(invitation)
for slot in slots:
slot.status = 'proposed'
invitation.slots.append(slot)
db.session.commit()
send_email(
to=candidate.email,
template='interview_invitation',
context={'invitation': invitation, 'slots': slots}
)
return invitation
```
```
-- Specification should say:
rule SendInvitation {
when: SendInvitation(candidacy, slots)
requires: slots.all(s => s.status = confirmed)
ensures:
for s in slots:
s.status = proposed
ensures: Invitation.created(
candidacy: candidacy,
slots: slots,
expires_at: now + 7.days,
status: pending
)
ensures: Email.created(
to: candidacy.candidate.email,
template: interview_invitation
)
}
```
What we dropped:
- `candidate_id: int` became just `candidacy`
- `db.session.query(...)` became relationship traversal
- `secrets.token_urlsafe(32)` removed entirely (token is implementation)
- `datetime.utcnow() + timedelta(...)` became `now + 7.days`
- `db.session.add/commit` implied by `created`
- `invitation.slots.append(slot)` implied by relationship
### Ask "Would a product owner care?"
For every detail in the code, ask:
| Code detail | Product owner cares? | Include? |
|-------------|---------------------|----------|
| Invitation expires in 7 days | Yes, affects candidate experience | Yes |
| Token is 32 bytes URL-safe | No, security implementation | No |
| Uses SQLAlchemy ORM | No, persistence mechanism | No |
| Email template name | Maybe, if templates are design decisions | Maybe |
| Slot status changes to 'proposed' | Yes, affects what candidate sees | Yes |
| Database transaction commits | No, implementation detail | No |
### Distinguish means from ends
**Means:** how the code achieves something.
**Ends:** what outcome the system needs.
| Means (code) | Ends (spec) |
|--------------|-------------|
| `requests.post('https://slack.com/api/...')` | `Notification.created(channel: slack)` |
| `candidate.oauth_token = google.exchange(code)` | `Candidate authenticated` |
| `redis.setex(f'session:{id}', 86400, data)` | `Session.created(expires: 24.hours)` |
| `for slot in slots: slot.status = 'cancelled'` | `for s in slots: s.status = cancelled` |
## The concrete detail problem
The hardest judgement call: when is a concrete detail part of the domain vs just implementation?
### Google OAuth example
You find this code:
```python
OAUTH_PROVIDERS = {
'google': GoogleOAuthProvider(client_id=..., client_secret=...),
}
def authenticate(provider: str, code: str) -> User:
return OAUTH_PROVIDERS[provider].authenticate(code)
```
**Question:** Is "Google OAuth" domain-level or implementation?
**It is implementation if:**
- Google is just the auth mechanism chosen
- It could be replaced with any OAuth provider
- Users do not see or care which provider
- The code is written generically (provider is a parameter)
**It is domain-level if:**
- Users explicitly choose Google (vs Microsoft, etc.)
- "Sign in with Google" is a feature
- Google-specific scopes or permissions are used
- Multiple providers are supported as a feature
**How to tell:** Look at the UI and user flows. If users see "Sign in with Google" as a choice, it is domain-level. If they just see "Sign in" and Google happens to be behind it, it is implementation.
### Database choice example
You find PostgreSQL-specific code:
```python
from sqlalchemy.dialects.postgresql import JSONB, ARRAY
class Candidate(Base):
skills = Column(ARRAY(String))
metadata = Column(JSONB)
```
**Almost always implementation.** The spec should say:
```
entity Candidate {
skills: Set<String>
metadata: String? -- or model specific fields
}
```
The specific database is rarely domain-level. Exception: if the system explicitly promises PostgreSQL compatibility or specific PostgreSQL features to users.
### Third-party integration example
You find Greenhouse ATS integration:
```python
class GreenhouseSync:
def import_candidate(self, greenhouse_id: str) -> Candidate:
data = self.client.get_candidate(greenhouse_id)
return Candidate(
name=data['name'],
email=data['email'],
greenhouse_id=greenhouse_id,
source='greenhouse'
)
```
**Could be either:**
**Implementation if:**
- Greenhouse is just where candidates happen to come from
- Could be swapped for Lever, Workable, etc.
- The integration is an implementation detail of "candidates are imported"
Spec:
```
external entity Candidate {
name: String
email: String
source: CandidateSource
}
```
**Product-level if:**
- "Greenhouse integration" is a selling point
- Users configure their Greenhouse connection
- Greenhouse-specific features are exposed (like syncing feedback back)
Spec:
```
external entity Candidate {
name: String
email: String
greenhouse_id: String? -- explicitly modeled
}
rule SyncFromGreenhouse {
when: GreenhouseWebhookReceived(candidate_data)
ensures: Candidate.created(
...
greenhouse_id: candidate_data.id
)
}
```
### The "Multiple implementations" heuristic
Look for variation in the codebase:
- If there is only one OAuth provider, probably implementation
- If there are multiple OAuth providers, probably domain-level
- If there is only one notification channel, probably implementation
- If there are Slack AND email AND SMS, probably domain-level
The presence of multiple implementations suggests the variation itself is a domain concern.
## Distillation process
### Step 1: Map the territory
Before extracting any specification, understand the codebase structure:
1. **Identify entry points.** API routes, CLI commands, message handlers, scheduled jobs.
2. **Find the domain models.** Usually in `models/`, `entities/`, `domain/`.
3. **Locate business logic.** Services, use cases, handlers.
4. **Note external integrations.** What third parties does it talk to?
Create a rough map:
```
Entry points:
- API: /api/candidates/*, /api/interviews/*, /api/invitations/*
- Webhooks: /webhooks/greenhouse, /webhooks/calendar
- Jobs: send_reminders, expire_invitations, sync_calendars
Models:
- Candidate, Interview, InterviewSlot, Invitation, Feedback
Services:
- SchedulingService, NotificationService, CalendarService
Integrations:
- Google Calendar, Slack, Greenhouse, SendGrid
```
### Step 2: Extract entity states
Look at enum fields and status columns:
```python
class Invitation(Base):
status = Column(Enum('pending', 'accepted', 'declined', 'expired'))
```
Becomes:
```
entity Invitation {
status: pending | accepted | declined | expired
}
```
Look for enum definitions, status or state columns, constants like `STATUS_PENDING = 'pending'`, and state machine libraries (e.g. `transitions`, `django-fsm`).
### Step 3: Extract transitions
Find where status changes happen:
```python
def accept_invitation(invitation_id: int, slot_id: int):
invitation = get_invitation(invitation_id)
if invitation.status != 'pending':
raise InvalidStateError()
if invitation.expires_at < datetime.utcnow():
raise ExpiredError()
slot = get_slot(slot_id)
if slot not in invitation.slots:
raise InvalidSlotError()
invitation.status = 'accepted'
slot.status = 'booked'
# Release other slots
for other_slot in invitation.slots:
if other_slot.id != slot_id:
other_slot.status = 'available'
# Create the interview
interview = Interview(
candidate_id=invitation.candidate_id,
slot_id=slot_id,
status='scheduled'
)
notify_interviewers(interview)
send_confirmation_email(invitation.candidate, interview)
```
Extract:
```
rule CandidateAcceptsInvitation {
when: CandidateAccepts(invitation, slot)
requires: invitation.status = pending
requires: invitation.expires_at > now
requires: slot in invitation.slots
ensures: invitation.status = accepted
ensures: slot.status = booked
ensures:
for s in invitation.slots:
if s != slot: s.status = available
ensures: Interview.created(
candidacy: invitation.candidacy,
slot: slot,
status: scheduled
)
ensures: Notification.created(to: slot.interviewers, ...)
ensures: Email.created(to: invitation.candidate.email, ...)
}
```
**Key extraction patterns:**
| Code pattern | Spec pattern |
|--------------|--------------|
| `if x.status != 'pending': raise` | `requires: x.status = pending` |
| `if x.expires_at < now: raise` | `requires: x.expires_at > now` |
| `if item not in collection: raise` | `requires: item in collection` |
| `x.status = 'accepted'` | `ensures: x.status = accepted` |
| `Model.create(...)` | `ensures: Model.created(...)` |
| `send_email(...)` | `ensures: Email.created(...)` |
| `notify(...)` | `ensures: Notification.created(...)` |
Assertions, checks and validations found in code (e.g. `assert balance >= 0`, class-level validators) may map to expression-bearing invariants rather than rule preconditions. Consider whether they describe a system-wide property or a rule-specific guard.
### Step 4: Find temporal triggers
Look for scheduled jobs and time-based logic:
```python
# In celery tasks or cron jobs
@app.task
def expire_invitations():
expired = Invitation.query.filter(
Invitation.status == 'pending',
Invitation.expires_at < datetime.utcnow()
).all()
for invitation in expired:
invitation.status = 'expired'
for slot in invitation.slots:
slot.status = 'available'
notify_candidate_expired(invitation)
@app.task
def send_reminders():
upcoming = Interview.query.filter(
Interview.status == 'scheduled',
Interview.slot.time.between(
datetime.utcnow() + timedelta(hours=1),
datetime.utcnow() + timedelta(hours=2)
)
).all()
for interview in upcoming:
send_reminder_notification(interview)
```
Extract:
```
rule InvitationExpires {
when: invitation: Invitation.expires_at <= now
requires: invitation.status = pending
ensures: invitation.status = expired
ensures:
for s in invitation.slots:
s.status = available
ensures: CandidateInformed(candidate: invitation.candidate, about: invitation_expired)
}
rule InterviewReminder {
when: interview: Interview.slot.time - 1.hour <= now
requires: interview.status = scheduled
ensures: Notification.created(to: interview.interviewers, template: reminder)
}
```
### Step 5: Identify external boundaries
Look for third-party API calls, webhook handlers, import/export functions, and data that is read but never written (or vice versa).
These often indicate external entities:
```python
# Candidate data comes from Greenhouse, we don't create it
def import_from_greenhouse(webhook_data):
candidate = Candidate.query.filter_by(
greenhouse_id=webhook_data['id']
).first()
if not candidate:
candidate = Candidate(greenhouse_id=webhook_data['id'])
candidate.name = webhook_data['name']
candidate.email = webhook_data['email']
```
Suggests:
```
external entity Candidate {
name: String
email: String
}
```
When repeated interface patterns appear across service boundaries (e.g. the same serialisation contract expected by multiple consumers), these suggest `contract` declarations for reuse rather than duplicated inline obligation blocks.
### Step 6: Abstract away implementation
Now make a pass through your extracted spec and remove implementation details.
**Before (too concrete):**
```
entity Invitation {
candidate_id: Integer
token: String(32)
created_at: DateTime
expires_at: DateTime
status: pending | accepted | declined | expired
}
```
**After (domain-level):**
```
entity Invitation {
candidacy: Candidacy
created_at: Timestamp
expires_at: Timestamp
status: pending | accepted | declined | expired
is_expired: expires_at <= now
}
```
Changes:
- `candidate_id: Integer` became `candidacy: Candidacy` (relationship, not FK)
- `token: String(32)` removed (implementation)
- `DateTime` became `Timestamp` (domain type)
- Added derived `is_expired` for clarity
Config values that derive from other config values (e.g. `extended_timeout = base_timeout * 2`) should use qualified references or expression-form defaults in the config block rather than independent literal values.
### Step 7: Validate with stakeholders
The extracted spec is a hypothesis. Validate it:
1. **Show the spec to the original developers.** "Is this what the system does?"
2. **Show to stakeholders.** "Is this what the system should do?"
3. **Look for gaps.** Code often has bugs or missing features; the spec might reveal them.
Common findings:
- "Oh, that retry logic was a hack, we should remove it"
- "Actually we wanted X but never built it"
- "These two code paths should be the same but aren't"
## Recognising library spec candidates
During distillation, stay alert for code that implements **generic integration patterns** rather than application-specific logic. These belong in library specs, not your main specification.
The same principle applies in elicitation. When a stakeholder describes "we use Google for login" or "payments go through Stripe", pause and consider whether this is a library spec.
### Signals in the code
**Third-party integration modules:**
```python
# Finding code like this suggests a library spec
class StripeWebhookHandler:
def handle_invoice_paid(self, event):
...
def handle_subscription_cancelled(self, event):
...
class GoogleOAuthProvider:
def exchange_code(self, code):
...
def refresh_token(self, refresh_token):
...
```
**Generic patterns with specific providers:**
- OAuth flows (Google, Microsoft, GitHub)
- Payment processing (Stripe, PayPal)
- Email delivery (SendGrid, Postmark, SES)
- Calendar sync (Google Calendar, Outlook)
- ATS integrations (Greenhouse, Lever)
- File storage (S3, GCS)
**Configuration-driven integrations:**
```python
# Heavy configuration suggests the integration itself is separable
OAUTH_CONFIG = {
'google': {'client_id': ..., 'scopes': ...},
'microsoft': {'client_id': ..., 'scopes': ...},
}
```
### Questions to ask
1. **"Is this integration logic, or application logic?"**
Integration: how to talk to Stripe.
Application: what to do when payment succeeds.
2. **"Would another application integrate the same way?"**
If yes, library spec candidate. If no, probably application-specific.
3. **"Does the code separate integration from application concerns?"**
If cleanly separated, easy to extract to library spec. If tangled, might need refactoring first (but the spec should still separate them).
### How to handle
**Option 1: Reference an existing library spec**
If a standard library spec exists for this integration:
```
use "github.com/allium-specs/stripe-billing/abc123" as stripe
-- Application responds to Stripe events
rule ActivateSubscription {
when: stripe/PaymentSucceeded(invoice)
...
}
```
**Option 2: Create a separate library spec**
If no standard spec exists but the integration is generic:
```
-- greenhouse-ats.allium (library spec)
-- Specifies: Greenhouse webhook events, candidate sync, etc.
-- interview-scheduling.allium (application spec)
use "./greenhouse-ats.allium" as greenhouse
rule ImportCandidate {
when: greenhouse/CandidateCreated(data)
ensures: Candidacy.created(...)
}
```
**Option 3: Abstract and move on**
If the integration is minor, just abstract it:
```
-- Don't specify Slack details, just:
ensures: Notification.created(
to: interviewers,
channel: slack
)
```
### Red flags: integration logic in your spec
If you find yourself writing spec like this, stop and reconsider:
```
-- TOO DETAILED - this is Stripe's domain, not yours
rule ProcessStripeWebhook {
when: WebhookReceived(payload, signature)
requires: verify_stripe_signature(payload, signature)
let event = parse_stripe_event(payload)
if event.type = "invoice.paid":
...
}
```
Instead:
```
-- Application responds to payment events (integration handled elsewhere)
rule PaymentReceived {
when: stripe/InvoicePaid(invoice)
...
}
```
### Common library spec extractions
| Code pattern found | Library spec candidate |
|-------------------|----------------------|
| OAuth token exchange, refresh, session management | `oauth2.allium` |
| Stripe webhook handling, subscription lifecycle | `stripe-billing.allium` |
| Email sending with templates, bounce handling | `email-delivery.allium` |
| Calendar event sync, availability checking | `calendar-integration.allium` |
| ATS candidate import, status sync | `greenhouse-ats.allium`, `lever-ats.allium` |
| File upload, virus scanning, thumbnail generation | `file-storage.allium` |
See patterns.md Pattern 8 for detailed examples of integrating library specs.
## Common distillation challenges
### Challenge: Duplicate terminology
When you find two terms for the same concept (across specs, within a spec, or between spec and code) treat it as a blocking problem.
```
-- BAD: Acknowledges duplication without resolving it
-- Order vs Purchase
-- checkout.allium uses "Purchase" - these are equivalent concepts.
```
This is not a resolution. When different parts of a codebase are built against different specs, both terms end up in the implementation: duplicate models, redundant join tables, foreign keys pointing both ways.
**What to do:**
- Choose one term. Cross-reference related specs before deciding.
- Update all references. Do not leave the old term in comments or "see also" notes.
- Note the rename in a changelog, not in the spec itself.
**Warning signs in code:**
- Two models representing the same concept (`Order` and `Purchase`)
- Join tables for both (`order_items`, `purchase_items`)
- Comments like "equivalent to X" or "same as Y"
The spec you extract must pick one term. Flag the other as technical debt to remove.
### Challenge: Implicit state machines
Code often has implicit states that are not modelled:
```python
# No explicit status field, but there's a state machine hiding here
class FeedbackRequest:
interview_id = Column(Integer)
interviewer_id = Column(Integer)
requested_at = Column(DateTime)
reminded_at = Column(DateTime, nullable=True)
feedback_id = Column(Integer, nullable=True) # FK to Feedback if submitted
```
The implicit states are:
- `pending`: requested_at set, feedback_id null, reminded_at null
- `reminded`: reminded_at set, feedback_id null
- `submitted`: feedback_id set
Extract to explicit:
```
entity FeedbackRequest {
interview: Interview
interviewer: Interviewer
requested_at: Timestamp
reminded_at: Timestamp?
status: pending | reminded | submitted
}
```
### Challenge: Scattered logic
The same conceptual rule might be spread across multiple places:
```python
# In API handler
def accept_invitation(request):
if invitation.status != 'pending':
return error(400, "Already responded")
...
# In model
class Invitation:
def can_accept(self):
return self.expires_at > datetime.utcnow()
# In service
def process_acceptance(invitation, slot):
if slot not in invitation.slots:
raise InvalidSlot()
...
```
Consolidate into one rule:
```
rule CandidateAccepts {
when: CandidateAccepts(invitation, slot)
requires: invitation.status = pending
requires: invitation.expires_at > now
requires: slot in invitation.slots
...
}
```
### Challenge: Dead code and historical accidents
Codebases accumulate features that were built but never used, workarounds for bugs that are now fixed, and code paths that are never executed.
Do not include these in the spec. If you are unsure:
1. Check if the code is actually reachable
2. Ask developers if it is intentional
3. Check git history for context
### Challenge: Missing error handling
Code might silently fail or have incomplete error handling:
```python
def send_notification(user, message):
try:
slack.send(user.slack_id, message)
except SlackError:
pass # Silently ignore failures
```
The spec should capture the intended behaviour, not the bug:
```
ensures: Notification.created(to: user, channel: slack)
```
Whether the current implementation properly handles failures is separate from what the system should do.
### Challenge: Over-engineered abstractions
Enterprise codebases often have abstraction layers that obscure intent:
```java
public interface NotificationStrategy {
void notify(NotificationContext context);
}
public class SlackNotificationStrategy implements NotificationStrategy {
@Override
public void notify(NotificationContext context) {
// Actual Slack call buried 5 levels deep
}
}
```
Cut through to the actual behaviour. The spec does not need strategy patterns, dependency injection or abstract factories. Just: `ensures: Notification.created(channel: slack, ...)`
## Checklist: Have you abstracted enough?
Before finalising a distilled spec:
- [ ] No database column types (Integer, VARCHAR, etc.)
- [ ] No ORM or query syntax
- [ ] No HTTP status codes or API paths
- [ ] No framework-specific concepts (middleware, decorators, etc.)
- [ ] No programming language types (int, str, List, etc.)
- [ ] No variable names from the code (use domain terms)
- [ ] No infrastructure (Redis, Kafka, S3, etc.)
- [ ] Foreign keys replaced with relationships
- [ ] Tokens/secrets removed (implementation of identity)
- [ ] Timestamps use domain Duration, not timedelta/seconds
If any remain, ask: "Would a stakeholder include this in a requirements doc?"
## Checklist: Terminology consistency
- [ ] Each concept has exactly one name throughout the spec
- [ ] No "also known as" or "equivalent to" comments
- [ ] Cross-referenced related specs for conflicting terms
- [ ] Duplicate models in code flagged as technical debt to remove
## After distillation
The extracted spec is a starting point. For targeted changes as requirements evolve, use /skill:tend. For checking ongoing alignment between the spec and implementation, use /skill:weed.
## References
- [Language reference](references/language-reference.md), full Allium syntax
- [Worked examples](references/worked-examples.md), complete code-to-spec examples in Python, TypeScript and Java

View File

@ -0,0 +1 @@
../../../../allium-main/references/language-reference.md

View File

@ -0,0 +1 @@
../../../../allium-main/skills/distill/references/worked-examples.md

348
.pi/skills/elicit/SKILL.md Normal file
View File

@ -0,0 +1,348 @@
---
name: elicit
description: "Run a structured discovery session to build an Allium specification through conversation. Use when the user wants to create a new spec from scratch, elicit or gather requirements, capture domain behaviour, specify a feature or system, define what a system should do, or is describing functionality and needs help shaping it into a specification."
disable-model-invocation: true
license: MIT
metadata:
upstream: https://github.com/juxt/allium
version: 3
---
# Elicitation
This skill guides you through building Allium specifications by conversation. The goal is to surface ambiguities and produce a specification that captures what the software does without prescribing implementation.
The same principles apply to distillation. Whether you are hearing a stakeholder describe a feature or reading code that implements it, the challenge is identical: finding the right level of abstraction.
## Scoping the specification
Before diving into details, establish what you are specifying. Not everything needs to be in one spec.
### Questions to ask first
**"What's the boundary of this specification?"** A complete system? A single feature area? One service in a larger system? Be explicit about what is in and out of scope.
**"Are there areas we should deliberately exclude?"** Third-party integrations might be library specs. Legacy features might not be worth specifying. Some features might belong in separate specs.
**"Is this a new system or does code already exist?"** If code exists, you are doing distillation with elicitation. Existing code constrains what is realistic to specify.
### Documenting scope decisions
Capture scope at the start of every spec:
```
-- allium: 3
-- interview-scheduling.allium
-- Scope: Interview scheduling for the hiring pipeline
-- Includes: Candidacy, Interview, Slot management, Invitations, Feedback
-- Excludes:
-- - Authentication (use oauth library spec)
-- - Payments (not applicable)
-- - Reporting dashboards (separate spec)
-- Dependencies: User entity defined in core.allium
```
The version marker (`-- allium: N`) must be the first line of every `.allium` file. Use the current language version number.
## Finding the right level of abstraction
The hardest part of specification is choosing what to include and what to leave out. Too concrete and you are specifying implementation. Too abstract and you are not saying anything useful.
### The "Why" test
For every detail, ask: "Why does the stakeholder care about this?"
| Detail | Why? | Include? |
|--------|------|----------|
| "Users log in with Google OAuth" | They need to authenticate | Maybe not, "Users authenticate" might be sufficient |
| "We support Google and Microsoft OAuth" | Users choose their provider | Yes, the choice is domain-level |
| "Sessions expire after 24 hours" | Security/UX decision | Yes, affects user experience |
| "Sessions are stored in Redis" | Performance | No, implementation detail |
| "Passwords must be 12+ characters" | Security policy | Yes, affects users |
| "Passwords are hashed with bcrypt" | Security implementation | No, how not what |
### The "Could it be different?" test
Ask: "Could this be implemented differently while still being the same system?"
- If yes, it is probably an implementation detail. Abstract it away.
- If no, it is probably domain-level. Include it.
Examples:
- "Notifications sent via Slack". Could be email, SMS, etc. Abstract to `Notification.created(channel: ...)`.
- "Interviewers must confirm within 3 hours". This specific deadline matters at the domain level. Include the duration.
- "We use PostgreSQL". Could be any database. Do not include.
- "Data is retained for 7 years for compliance". Regulatory requirement. Include.
### The "Template vs Instance" test
Is this a category of thing, or a specific instance?
| Instance (implementation) | Template (domain-level) |
|---------------------------|-------------------------|
| Google OAuth | Authentication provider |
| Slack | Notification channel |
| 15 minutes | Link expiry duration (configurable) |
| Greenhouse ATS | External candidate source |
Sometimes the instance IS the domain concern. "We specifically integrate with Salesforce" might be a competitive feature. "We support exactly these three OAuth providers" might be design scope.
When in doubt, ask the stakeholder: "If we changed this, would it be a different system or just a different implementation?"
### Levels of abstraction
```
Too abstract: "Users can do things"
|
Product level: "Candidates can accept or decline interview invitations"
|
Too concrete: "Candidates click a button that POST to /api/invitations/:id/accept"
```
**Signs you are too abstract.** The spec could describe almost any system. No testable assertions. Product owner says "but that doesn't capture..."
**Signs you are too concrete.** You are mentioning technologies, frameworks or APIs. You are describing UI elements (buttons, pages, forms). The implementation team says "why are you dictating how we build this?"
### Configuration vs hardcoding
When you encounter a specific value (3 hours, 7 days, etc.), ask:
1. **Is this value a design decision?** Include it.
2. **Might it vary per deployment or customer?** Make it configurable.
3. **Is it arbitrary?** Consider whether to include it at all.
```
-- Hardcoded design decision
rule InvitationExpires {
when: invitation: Invitation.created_at + 7.days <= now
...
}
-- Configurable
config {
invitation_expiry: Duration = 7.days
}
rule InvitationExpires {
when: invitation: Invitation.created_at + config.invitation_expiry <= now
...
}
```
### Black boxes
Some logic is important but belongs at a different level:
```
-- Black box: we know it exists and what it considers, but not how
ensures: Suggestion.created(
interviewers: InterviewerMatching.suggest(
considering: {
role.required_skills,
Interviewer.skills,
Interviewer.availability,
Interviewer.recent_load
}
)
)
```
The spec says there is a matching algorithm, that it considers these inputs and that it produces interviewer suggestions. The spec does not say how matching works, what weights are used or the specific algorithm.
This is the right level when the algorithm is complex and evolving, when product owners care about inputs and outputs rather than internals, and when a separate detailed spec could cover it if needed.
## Elicitation methodology
### Phase 1: Scope definition
**Goal:** Understand what we are specifying and where the boundaries are.
Questions to ask:
1. "What is this system fundamentally about? In one sentence?"
2. "Where does this system start and end? What's in scope vs out?"
3. "Who are the users? Are there different roles?"
4. "What are the main things being managed, the nouns?"
5. "Are there existing systems this integrates with? What do they handle?"
**Outputs:** List of actors and roles. List of core entities. Boundary decisions (what is external). One-sentence description.
**Watch for:** Scope creep ("and it also does X, Y, Z", gently refocus). Assumed knowledge ("obviously it handles auth", make explicit).
### Phase 2: Happy path flow
**Goal:** Trace the main journey from start to finish.
Questions to ask:
1. "Walk me through a typical [X] from start to finish"
2. "What happens first? Then what?"
3. "What triggers this? A user action? Time passing? Something else?"
4. "What changes when that happens? What state is different?"
5. "Who needs to know when this happens? How?"
**Technique:** Follow one entity through its lifecycle.
```
Candidacy:
pending_scheduling -> scheduling_in_progress -> scheduled ->
interview_complete -> feedback_collected -> decided
```
**Outputs:** State machines for key entities. Main triggers and their outcomes. Communication touchpoints.
**Watch for:** Jumping to edge cases too early ("but what if...", note it and stay on happy path). Implementation details creeping in ("the API endpoint...", redirect to outcomes).
### Phase 3: Edge cases and errors
**Goal:** Discover what can go wrong and how the system handles it.
Questions to ask:
1. "What if [actor] doesn't respond?"
2. "What if [condition] isn't met when they try?"
3. "What if this happens twice? Or in the wrong order?"
4. "How long should we wait before [action]?"
5. "When should a human be alerted to intervene?"
6. "What if [external system] is unavailable?"
**Technique:** For each rule, ask "what are all the ways requires could fail?"
**Outputs:** Timeout and deadline rules. Retry and escalation logic. Error states. Recovery paths.
**Watch for:** Infinite loops ("then it retries, then retries again...", need terminal states). Missing escalation, because eventually a human needs to know.
When stakeholders state system-wide properties ("balance never goes negative", "no two interviews overlap for the same candidate"), these are candidates for top-level invariants. Capture them as `invariant Name { expression }` declarations.
### Phase 4: Refinement
**Goal:** Clean up the specification and identify gaps.
Questions to ask:
1. "Looking at [entity], are these states complete? Can it be in any other state?"
2. "Is there anything we haven't covered?"
3. "This rule references [X], do we need to define that, or is it external?"
4. "Is this detail essential here, or should it live in a detailed spec?"
**Technique:** Read back the spec and ask "does this match your mental model?"
**Outputs:** Complete entity definitions. Open questions documented. Deferred specifications identified. External boundaries confirmed.
When the same obligation pattern (e.g. a serialisation contract, a deterministic evaluation requirement) appears across multiple surfaces, suggest extracting it as a `contract` declaration for reuse.
## Elicitation principles
### Ask one question at a time
Bad: "What entities do you have, and what states can they be in, and who can modify them?"
Good: "What are the main things this system manages?"
Then: "Let's take [Candidacy]. What states can it be in?"
Then: "Who can change a candidacy's state?"
### Work through implications
When a choice arises, do not just accept the first answer. Explore consequences.
"You said invitations expire after 48 hours. What happens then?"
"And if the candidate still hasn't responded after we retry?"
"What if they never respond, is this candidacy stuck forever?"
This surfaces decisions they have not made yet.
### Distinguish product from implementation
When you hear implementation language, redirect:
| They say | You redirect |
|----------|-------------|
| "The API returns a 404" | "So the user is informed it's not found?" |
| "We store it in Postgres" | "What information is captured?" |
| "The frontend shows a modal" | "The user is prompted to confirm?" |
| "We use a cron job" | "This happens on a schedule, how often?" |
### Surface ambiguity explicitly
Better to record an open question than assume.
"I'm not sure whether declining should return the candidate to the pool or remove them entirely. Let me note that as an open question."
```
open question "When candidate declines, do they return to pool or exit?"
```
### Use concrete examples
Abstract discussions get stuck. Ground them.
"Let's say Alice is a candidate for the Senior Engineer role. She's been sent an invitation with three slots. Walk me through what happens when she clicks on Tuesday 2pm."
### Iterate willingly
It is normal to revise earlier decisions.
"Earlier we said all admins see all notifications. But now you're describing role-specific dashboards. Should we revisit that?"
### Know when to stop
Not everything needs to be specified now.
"This is getting into how the matching algorithm works. Should we defer that to a detailed spec?"
"We've covered the main flow. The reporting dashboard sounds like a separate specification."
## Common elicitation traps
### The "Obviously" trap
When someone says "obviously" or "of course", probe. "You said obviously the admin approves. Is there ever a case where they don't need to? Could this be automated later?"
### The "Edge Case Spiral" trap
Some people want to cover every edge case immediately. "Let's capture that as an open question and stay on the main flow for now. We'll come back to edge cases."
### The "Technical Solution" trap
Engineers especially jump to solutions. "I hear you saying we need real-time updates. At the domain level, what does the user need to see and when?"
### The "Vague Agreement" trap
Do not accept "yes" without specifics. "You said yes, candidates can reschedule. How many times? Is there a limit? What happens after that?"
### The "Missing Actor" trap
Watch for actions without clear actors. "You said 'the slots are released'. Who or what releases them? Is it automatic, or does someone trigger it?"
### The "Equivalent Terms" trap
When you hear two terms for the same concept, from different stakeholders, existing code or related specs, stop and resolve it before continuing.
"You said 'Purchase' but earlier we called this an 'Order'. Which term should we use?"
A comment noting that two terms are equivalent is not a resolution. It guarantees both will appear in the implementation. Pick one term, cross-reference related specs and update all references. Do not leave the old term anywhere, not even in "see also" notes.
## Elicitation session structure
**Opening (5 min).** Explain Allium briefly: "We're capturing what the software does, not how it's built." Set expectations: "I'll ask lots of questions, some obvious-seeming." Agree on scope for this session.
**Scope definition (10-15 min).** Identify actors, entities, boundaries. Get the one-sentence description.
**Happy path (20-30 min).** Trace main flow start to finish. Capture states, triggers, outcomes. Note communications.
**Edge cases (15-20 min).** Timeouts and deadlines. Failure modes. Escalation paths.
**Wrap-up (5-10 min).** Read back key decisions. List open questions. Identify next session scope if needed.
**After session.** Write up specification draft. Send for review. Note questions for next session.
## After elicitation
For targeted changes where you already know what you want, use /skill:tend. For substantial additions that need structured discovery (new feature areas, complex entity relationships, unclear requirements), elicit is still the right tool even if a spec already exists. Checking alignment between specs and implementation belongs to /skill:weed.
## References
- [Language reference](references/language-reference.md), full Allium syntax
- [Recognising library spec opportunities](references/library-spec-signals.md), signals, questions and decision framework for identifying library specs during elicitation

View File

@ -0,0 +1 @@
../../../../allium-main/references/language-reference.md

View File

@ -0,0 +1 @@
../../../../allium-main/skills/elicit/references/library-spec-signals.md

View File

@ -0,0 +1,214 @@
---
name: propagate
description: "Generate tests from Allium specifications. Use when the user wants to propagate tests, generate test files from a spec, write tests for a specification, create property-based tests, produce state machine tests, check test coverage against spec obligations, or understand what tests a specification requires."
disable-model-invocation: true
license: MIT
metadata:
upstream: https://github.com/juxt/allium
version: 3
---
# Propagation
This skill generates tests from Allium specifications. Propagation is how plants reproduce from cuttings of the parent: the spec is the parent, the tests are the offspring.
Deterministic tools guarantee completeness (every spec construct maps to a test obligation). You handle the implementation bridge: correlating spec constructs with code, generating tests in the project's conventions.
## Prerequisites
Before propagating tests, you need:
1. **An Allium spec** — the `.allium` file describing the system's behaviour
2. **A target codebase** — the implementation to test
3. **Test obligations** — from `allium plan <spec>` (JSON listing every required test)
4. **Domain model** — from `allium model <spec>` (JSON describing entity shapes, constraints, state machines)
If the CLI tools are not available, derive test obligations manually from the spec using the test-generation taxonomy in `references/test-generation.md`.
## Modes
### Surface mode
Generates boundary tests from surface declarations. Use when the user wants to test an API, UI contract or integration boundary.
For each surface in the spec:
1. **Exposure tests** — verify each item in `exposes` is accessible to the specified actor, including `for` iteration over collections
2. **Provides tests** — verify operations appear when their `when` conditions are true and are hidden otherwise, including when the corresponding rule's `requires` clauses are not met
3. **Actor restriction tests** — verify the surface is not accessible to other actor types
4. **Actor identification tests** — verify only entities matching the actor's `identified_by` predicate can interact; for actors with `within`, verify interaction is scoped to the declared context
5. **Context scoping tests** — verify the surface instance is absent when no entity matches the `context` predicate
6. **Contract obligation tests** — verify `demands` are satisfied by the counterpart, `fulfils` are supplied by this surface, including all typed signatures
7. **Guarantee tests** — verify `@guarantee` annotations hold across the boundary
8. **Timeout tests** — verify referenced temporal rules fire within the surface's context
9. **Related navigation tests** — verify navigation to related surfaces resolves to the correct context entity
### Spec mode
Walks the full test obligations document. Use when the user wants comprehensive test coverage for the entire specification.
Categories from the test-generation taxonomy:
- **Entity and value type tests** — fields, types, optional (`?`) null handling, `when`-clause state-dependent presence, relationships, join lookups, equality
- **Enum tests** — comparability across named enums, membership tests, inline enum isolation
- **Sum type tests** — variant fields, type guards, exhaustiveness, creation via variant name, base `.created` trigger narrowing
- **Derived value and projection tests** — computation, filtering, `-> field` extraction, parameterised derived values, `now` volatility, collection operations
- **Default instance tests** — unconditional existence, field values, cross-references between defaults
- **Config tests** — defaults, overrides, mandatory parameters, expression-form defaults, qualified references, config chains
- **Invariant tests** — post-rule verification, edge cases, implication logic, entity-level invariants
- **Rule tests** — success/failure/edge cases, conditionals (ensuring `if` guards read resulting state), entity creation, removal, bulk updates, rule-level `for` iteration, `let` bindings, chained triggers
- **State transition tests** — valid/invalid transitions, terminal states, `transitions_to` vs `becomes` semantics
- **Temporal tests** — deadline boundaries, re-firing prevention, optional field null behaviour
- **Surface tests** — exposure, availability, actor identification with `within` scoping, context scoping, related navigation
- **Contract tests** — signature satisfaction, `@invariant` honouring, `demands`/`fulfils` direction
- **Cross-module tests** — qualified entity references, external trigger responses, type placeholder substitution
- **Cross-rule interaction tests** — duplicate creation guards, provides availability
- **Transition graph tests** — every declared edge is reachable via its witnessing rule, undeclared transitions are rejected, terminal states have no outbound rules, non-terminal states have at least one exit, exact correspondence between enum values and graph edges
- **State-dependent field tests** — presence when in qualifying state, absence when outside, presence obligations on entering the `when` set, absence obligations on leaving, no obligation when moving within or outside, convergent transitions all set the field, guard required to access `when`-qualified fields, derived value `when` inference via input intersection
- **Scenario tests** — happy path, edge cases, order independence
## Test output kinds
### 1. Assertion-based tests
For deterministic obligations: field presence, enum membership, transition validity, surface exposure, state-dependent field presence and absence. These are standard unit/integration tests.
### 2. Property-based tests
For invariants and rule properties. Each expression-bearing invariant becomes a PBT property:
- Generate a valid entity state using the generator spec
- Apply a sequence of rules (following the transition graph when declared, or deriving valid sequences from rules alone)
- Check the invariant holds at every step
Use the project's PBT framework:
| Language | Framework | Discovery |
|----------|-----------|-----------|
| TypeScript | fast-check | `package.json` |
| Python | Hypothesis | `pyproject.toml` |
| Rust | proptest | `Cargo.toml` |
| Go | rapid | `go.mod` |
| Elixir | StreamData | `mix.exs` |
Fall back to assertion-based tests if no PBT framework is present.
### 3. State machine tests
For entities with status enums. When a transition graph is declared, walk every path through the graph. When no graph is declared, derive valid transitions from rules.
- Verify transitions succeed via witnessing rules
- Verify rejected transitions fail
- Verify state-dependent fields are present or absent at each state per their `when` clauses
- Verify invariants hold at each state
State machine tests require an **action map**: a function per transition edge that takes the entity in the source state and produces it in the target state by calling the actual implementation code. Without this map, the test framework can describe valid paths through the graph but cannot execute them.
To build the action map:
1. For each edge in the transition graph, find the witnessing rule in the spec
2. Find the code implementing that rule (the implementation bridge)
3. Write a test action that sets up the preconditions (`requires` clauses), invokes the code, and returns the entity in the target state
4. Register the action under the `(from_state, to_state)` key
Once the map is built, the PBT framework can walk random valid paths: start at any non-terminal state, pick a random outbound edge, apply its action, check all entity-level invariants, repeat. The path length and starting state are generated randomly. This is the fullest expression of the spec's transition graph as a test.
## The implementation bridge
You correlate spec constructs with implementation code, the same way the weed skill correlates for divergence checking.
### For surface tests
Map surfaces to their implementation:
- API surfaces map to endpoints (REST routes, GraphQL resolvers, gRPC services)
- UI surfaces map to components or pages
- Integration surfaces map to message handlers or SDK methods
Discover the mapping by reading the codebase. Look for naming patterns, route definitions and handler registrations.
### For internal tests
For each rule in the spec:
1. Find the code implementing the rule (service method, event handler, state machine transition)
2. Determine how to instantiate the entities involved (factories, builders, fixtures)
3. Determine how to invoke the rule (API call, method call, event dispatch)
4. Determine how to assert postconditions (database queries, return values, event assertions)
### For temporal tests
Temporal triggers (deadline-based rules) need a controllable time source in the test. If the implementation uses wall-clock time (`Instant.now()`, `System.currentTimeMillis()`), the test cannot reliably position itself before, at or after a deadline.
Before attempting temporal tests, check whether the component accepts an injected clock or time parameter. Common patterns: a `Clock` parameter on the constructor, an epoch-millisecond argument on the method, a `TimeProvider` interface. If the seam exists, inject a controllable time source. If it does not, flag this as a test infrastructure gap: the temporal tests cannot be generated until the component supports time injection. Do not attempt to test temporal behaviour by sleeping or racing against wall-clock time.
### For cross-module trigger chains
When a rule emits a trigger that another spec's rule receives (e.g. the Arbiter emits `ClerkReceivesEvent`, the Clerk handles it), testing the chain requires multiple components wired together.
Before generating cross-module tests:
1. Trace the trigger emission graph from the plan output: which rules emit triggers, and which rules in other specs receive them
2. Check whether the codebase has an existing integration test fixture that wires the participating components (a pipeline test, an end-to-end test helper, a test harness class)
3. If a fixture exists, reuse it. Cross-module tests should compose existing wiring, not rebuild it
4. If no fixture exists, generate only the test skeleton with TODOs marking where component wiring is needed
Cross-module tests are integration tests by nature. They verify that the spec's trigger chains are faithfully implemented across component boundaries, but the setup cost is high. Prioritise them after single-component tests are passing.
### Reusing existing tests
When exploring the codebase, note which spec obligations are already covered by existing tests. An existing integration test that exercises the happy path from event submission through to acknowledged output already covers multiple `rule_success` obligations and the end-to-end scenario.
When an existing test covers a spec obligation, reference it rather than generating a duplicate. The propagate skill's value at the integration level is verifying that coverage is complete against the spec's obligation list, identifying gaps, and generating tests to fill them. Replacing working hand-written tests with generated equivalents adds no value.
### For deferred specs
Deferred specifications are fully specified in separate files. When the target codebase doesn't include the deferred spec's module, generate a test stub with a placeholder:
```typescript
// TODO: deferred spec — InterviewerMatching.suggest
// This behaviour is specified as deferred. Provide a mock or skip.
```
## Process
1. **Read the spec** — understand entities, rules, surfaces, invariants, transition graphs, state-dependent fields, contracts, config, defaults
2. **Read test obligations** — from `allium plan` output or manual derivation
3. **Read domain model** — from `allium model` output or manual derivation
4. **Explore the codebase** — find existing tests, test framework, entity implementations, rule implementations
5. **Map constructs to code** — correlate spec entities/rules/surfaces with implementation classes/functions/endpoints
6. **Generate tests** — produce test files following the project's conventions
7. **Verify tests compile/run** — ensure generated tests are syntactically valid
### Discovery checklist
Before generating tests, establish:
- [ ] Test framework and runner (Jest, pytest, cargo test, etc.)
- [ ] PBT framework if present (fast-check, Hypothesis, proptest, etc.)
- [ ] Test file location conventions (co-located, `__tests__/`, `tests/`, etc.)
- [ ] Entity/model location and patterns (classes, interfaces, structs)
- [ ] Factory/fixture patterns for test data
- [ ] How state transitions are implemented (methods, events, state machines)
- [ ] How surfaces are implemented (routes, controllers, resolvers)
- [ ] Existing test helpers or utilities
- [ ] Whether components accept injected time sources for temporal tests
- [ ] Whether an integration test fixture exists for cross-module trigger chains
- [ ] Which spec obligations are already covered by existing tests
### Generator awareness
When generator specs are available, use them to produce valid test data:
- Respect field types and constraints
- For entities with transition graphs, generate entities at specific lifecycle states with correct field presence per `when` clauses (e.g. a `shipped` Order has `tracking_number` and `shipped_at` populated; a `pending` Order does not)
- For invariants, generate states that exercise boundary conditions
- For config parameters, use declared defaults unless testing overrides
## Interaction with other tools
- /skill:distill produces specs from code. Those specs feed propagate.
- /skill:weed checks alignment. After propagating tests, weed verifies spec-code match.
- /skill:tend evolves specs. After spec changes, run propagate again to update tests.
- /skill:elicit builds specs through conversation. Once a spec is ready, propagate generates tests.
## Limitations
- Generated tests are a starting point. They may need adjustment for project-specific patterns.
- The implementation bridge is LLM-mediated. Complex or unusual codebases may need manual guidance on the mapping.
- Cross-module test generation is not yet supported. Each spec generates tests independently.
- Runtime trace validation and model checking are separate workstreams.

View File

@ -0,0 +1 @@
../../../../allium-main/references/test-generation.md

View File

@ -64,14 +64,14 @@ _Goal: get the root `/skill:allium` working in pi with a local model._
### Phase 2: Port elicit, distill, propagate skills
_Goal: all three sub-skills work via `/skill:elicit` etc._
- [ ] Create `.pi/skills/elicit/SKILL.md` — adapt frontmatter from `allium-main/skills/elicit/SKILL.md`
- [ ] Symlink references: `ln -s ../../../allium-main/skills/elicit/references .pi/skills/elicit/references`
- [ ] Smoke test `/skill:elicit` — start a mini elicitation session, verify it follows the methodology
- [ ] Create `.pi/skills/distill/SKILL.md` — adapt frontmatter from `allium-main/skills/distill/SKILL.md`
- [ ] Symlink references: `ln -s ../../../allium-main/skills/distill/references .pi/skills/distill/references`
- [ ] Smoke test `/skill:distill`
- [ ] Create `.pi/skills/propagate/SKILL.md` — adapt frontmatter from `allium-main/skills/propagate/SKILL.md`
- [ ] Smoke test `/skill:propagate`
- [x] Create `.pi/skills/elicit/SKILL.md` — adapt frontmatter from `allium-main/skills/elicit/SKILL.md`
- [x] Symlink references: individual symlinks in `.pi/skills/elicit/references/` (language-reference.md + library-spec-signals.md)
- [x] Smoke test `/skill:elicit` — start a mini elicitation session, verify it follows the methodology
- [x] Create `.pi/skills/distill/SKILL.md` — adapt frontmatter from `allium-main/skills/distill/SKILL.md`
- [x] Symlink references: individual symlinks in `.pi/skills/distill/references/` (language-reference.md + worked-examples.md)
- [x] Smoke test `/skill:distill`
- [x] Create `.pi/skills/propagate/SKILL.md` — adapt frontmatter from `allium-main/skills/propagate/SKILL.md`
- [x] Smoke test `/skill:propagate`
### Phase 3: Test with turn-limit extension (TDD)
_Goal: use distill → propagate on real code, verify allium produces useful output._

39
smoke/run-all.sh Executable file
View File

@ -0,0 +1,39 @@
#!/usr/bin/env bash
set -uo pipefail
# Run all Phase 2 smoke tests
cd "$(dirname "$0")"
passed=0
failed=0
results=()
run_test() {
local name="$1"
local script="$2"
echo ""
echo "━━━ $name ━━━"
if bash "$script"; then
results+=("$name")
((passed++))
else
results+=("$name")
((failed++))
fi
}
run_test "structure" structure.sh
run_test "elicit" test-elicit.sh
run_test "distill" test-distill.sh
run_test "propagate" test-propagate.sh
echo ""
echo "━━━ summary ━━━"
for r in "${results[@]}"; do
echo "$r"
done
echo ""
echo "passed: $passed failed: $failed"
[ "$failed" -eq 0 ]

77
smoke/structure.sh Executable file
View File

@ -0,0 +1,77 @@
#!/usr/bin/env bash
set -uo pipefail
# Structural smoke tests for Phase 2 skills
# Verifies files exist, frontmatter is valid, and reference symlinks resolve
cd "$(dirname "$0")/.."
pass=0
fail=0
check() {
local desc="$1"
shift
if "$@" >/dev/null 2>&1; then
printf ' \033[32m✓\033[0m %s\n' "$desc"
((pass++))
else
printf ' \033[31m✗\033[0m %s\n' "$desc"
((fail++))
fi
}
# Inverse check: passes when the command FAILS (e.g. grep finds nothing)
check_absent() {
local desc="$1"
shift
if "$@" >/dev/null 2>&1; then
printf ' \033[31m✗\033[0m %s\n' "$desc"
((fail++))
else
printf ' \033[32m✓\033[0m %s\n' "$desc"
((pass++))
fi
}
for skill in elicit distill propagate; do
echo "=== $skill ==="
check "SKILL.md exists" test -f ".pi/skills/$skill/SKILL.md"
check "frontmatter: name: $skill" grep -q "^name: $skill" ".pi/skills/$skill/SKILL.md"
check "frontmatter: disable-model-invocation" grep -q "^disable-model-invocation:" ".pi/skills/$skill/SKILL.md"
check "frontmatter: license" grep -q "^license:" ".pi/skills/$skill/SKILL.md"
check "frontmatter: metadata.upstream" grep -q "upstream:" ".pi/skills/$skill/SKILL.md"
check "references/ directory exists" test -d ".pi/skills/$skill/references"
done
echo "=== reference symlinks ==="
check "elicit: language-reference.md resolves" test -s ".pi/skills/elicit/references/language-reference.md"
check "elicit: library-spec-signals.md resolves" test -s ".pi/skills/elicit/references/library-spec-signals.md"
check "distill: language-reference.md resolves" test -s ".pi/skills/distill/references/language-reference.md"
check "distill: worked-examples.md resolves" test -s ".pi/skills/distill/references/worked-examples.md"
check "propagate: test-generation.md resolves" test -s ".pi/skills/propagate/references/test-generation.md"
echo "=== cross-references in SKILL.md ==="
check "elicit: refs use references/ prefix" grep -q '(references/language-reference.md)' ".pi/skills/elicit/SKILL.md"
check "elicit: sub-ref uses references/ prefix" grep -q '(references/library-spec-signals.md)' ".pi/skills/elicit/SKILL.md"
check "distill: refs use references/ prefix" grep -q '(references/language-reference.md)' ".pi/skills/distill/SKILL.md"
check "distill: sub-ref uses references/ prefix" grep -q '(references/worked-examples.md)' ".pi/skills/distill/SKILL.md"
check "propagate: refs use references/ prefix" grep -q 'references/test-generation.md' ".pi/skills/propagate/SKILL.md"
echo "=== invocation references ==="
check "elicit: uses /skill:tend" grep -q '/skill:tend' ".pi/skills/elicit/SKILL.md"
check "elicit: uses /skill:weed" grep -q '/skill:weed' ".pi/skills/elicit/SKILL.md"
check "distill: uses /skill:tend" grep -q '/skill:tend' ".pi/skills/distill/SKILL.md"
check "distill: uses /skill:weed" grep -q '/skill:weed' ".pi/skills/distill/SKILL.md"
check "propagate: uses /skill:distill" grep -q '/skill:distill' ".pi/skills/propagate/SKILL.md"
check "propagate: uses /skill:elicit" grep -q '/skill:elicit' ".pi/skills/propagate/SKILL.md"
echo "=== no upstream path leaks ==="
check_absent "elicit: no ../../references/" grep -q '\.\./\.\./references/' ".pi/skills/elicit/SKILL.md"
check_absent "distill: no ../../references/" grep -q '\.\./\.\./references/' ".pi/skills/distill/SKILL.md"
check_absent "elicit: no \`tend\` skill" grep -q '`tend` skill' ".pi/skills/elicit/SKILL.md"
check_absent "distill: no \`tend\` skill" grep -q '`tend` skill' ".pi/skills/distill/SKILL.md"
echo ""
echo "passed: $pass failed: $fail"
[ "$fail" -eq 0 ]

39
smoke/test-distill.sh Executable file
View File

@ -0,0 +1,39 @@
#!/usr/bin/env bash
set -uo pipefail
# Smoke test: /skill:distill loads and the model responds with distillation methodology
# Expects: the response discusses extracting entities/rules from code
cd "$(dirname "$0")/.."
MODEL="Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
TIMEOUT=180
echo "=== distill: invoking pi ==="
output=$(timeout "$TIMEOUT" pi -p --no-session --model "$MODEL" \
"/skill:distill I have a TypeScript function that checks if a user's subscription is active by comparing expiry date to now, and if expired sets status to 'expired'. How would you begin distilling this into an Allium spec? Keep it under 100 words." 2>&1)
rc=$?
if [ $rc -ne 0 ]; then
echo " FAIL: pi exited with code $rc"
echo "$output" | tail -20
exit 1
fi
if [ -z "$output" ]; then
echo " FAIL: empty response"
exit 1
fi
echo "$output"
echo ""
# Check the response is relevant to distillation
if echo "$output" | grep -iqE 'entit|rule|status|domain|abstract|spec|allium|subscription|temporal|transition'; then
echo " PASS: response contains distillation-relevant content"
exit 0
else
echo " FAIL: response does not appear to follow distillation methodology"
exit 1
fi

39
smoke/test-elicit.sh Executable file
View File

@ -0,0 +1,39 @@
#!/usr/bin/env bash
set -uo pipefail
# Smoke test: /skill:elicit loads and the model responds with elicitation methodology
# Expects: the response references scoping, entities, or asks discovery questions
cd "$(dirname "$0")/.."
MODEL="Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
TIMEOUT=180
echo "=== elicit: invoking pi ==="
output=$(timeout "$TIMEOUT" pi -p --no-session --model "$MODEL" \
"/skill:elicit I want to specify a simple counter that increments and resets. What are the first questions you would ask? Keep it under 100 words." 2>&1)
rc=$?
if [ $rc -ne 0 ]; then
echo " FAIL: pi exited with code $rc"
echo "$output" | tail -20
exit 1
fi
if [ -z "$output" ]; then
echo " FAIL: empty response"
exit 1
fi
echo "$output"
echo ""
# Check the response is relevant to elicitation
if echo "$output" | grep -iqE 'scope|boundar|entit|actor|specif|question|what.*system|who.*user'; then
echo " PASS: response contains elicitation-relevant content"
exit 0
else
echo " FAIL: response does not appear to follow elicitation methodology"
exit 1
fi

39
smoke/test-propagate.sh Executable file
View File

@ -0,0 +1,39 @@
#!/usr/bin/env bash
set -uo pipefail
# Smoke test: /skill:propagate loads and the model responds with test generation methodology
# Expects: the response discusses test obligations, assertions, or test categories
cd "$(dirname "$0")/.."
MODEL="Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
TIMEOUT=180
echo "=== propagate: invoking pi ==="
output=$(timeout "$TIMEOUT" pi -p --no-session --model "$MODEL" \
"/skill:propagate Given an Allium entity Order with status: pending | confirmed | shipped and a rule ConfirmOrder that transitions pending to confirmed, what test obligations would you derive? Keep it under 100 words." 2>&1)
rc=$?
if [ $rc -ne 0 ]; then
echo " FAIL: pi exited with code $rc"
echo "$output" | tail -20
exit 1
fi
if [ -z "$output" ]; then
echo " FAIL: empty response"
exit 1
fi
echo "$output"
echo ""
# Check the response is relevant to test generation
if echo "$output" | grep -iqE 'test|assert|obligation|transition|valid|invalid|state|verif|propert'; then
echo " PASS: response contains propagation-relevant content"
exit 0
else
echo " FAIL: response does not appear to follow propagation methodology"
exit 1
fi