Boost Your CI: Parallel Integration Tests For Langstar

Dec 2, 2025 by Admin 55 views

Hey there, fellow developers! Let's talk about something super important that impacts our daily workflow: speed. Specifically, how we can significantly boost our CI (Continuous Integration) pipeline by parallelizing integration tests for Langstar. If you've ever found yourself twiddling your thumbs, waiting for those tests to finish, you know exactly what I'm talking about. Our current setup, while ensuring stability, is definitely holding us back in terms of raw execution speed. This article dives deep into the "why" and "how" of making our integration tests run much, much faster, ultimately leading to quicker feedback loops and a more efficient development process for everyone on the Codekiln and Langstar teams.

The Big Bottleneck: Why Our CI Is Currently Slow (and How We're Fixing It)

Right now, our CI pipeline, specifically for Langstar's integration tests, is taking a leisurely stroll rather than a sprint. The core issue? These tests are running sequentially – one after another – which is like having a single-lane highway for hundreds of cars. We're talking about approximately 367 integration-related tests that just queue up, patiently waiting for their turn. This serial execution is enforced by a specific flag in our ci.yml workflow: --test-threads=1. This means that even if your machine (or the CI runner) has multiple CPU cores ready to churn through tasks, it's deliberately told to use only one thread for all integration tests. The result? A significant bottleneck that stretches our CI times, sometimes pushing us close to our 15-minute timeout limit. Imagine waiting that long just to know if your latest change broke something. It's not ideal, folks!

Why We Went Serial in the First Place

You might be wondering, "Why did we even set it up this way?" And that's a totally fair question. The decision to run tests with --test-threads=1 wasn't arbitrary; it was a pragmatic choice to avoid some pretty tricky issues inherent in concurrent testing. First off, we needed to prevent deployment name collisions. Many of our integration tests interact with external services, creating or modifying resources with specific names. If two tests tried to create a deployment with the exact same name at the same time, chaos would ensue! Secondly, there's the looming specter of API rate limiting. When you hit an external API too aggressively, it'll start to politely (or not-so-politely) tell you to slow down. Running tests sequentially helps us space out those API calls. Finally, there's the challenge of managing shared resources, especially things like OnceLock<TestDeployment>. This clever Rust primitive ensures that a piece of data is initialized only once, which is great for shared setup. However, it also means that if multiple tests try to manipulate that single shared deployment simultaneously, you're going to get unpredictable and flaky results, which is even worse than slow tests. So, while --test-threads=1 felt like a necessary evil to keep things stable, it’s now the primary target for optimization. Our goal here is to carefully unpick these dependencies and allow the vast majority of our tests to run in parallel without introducing new instabilities. We want the best of both worlds: speed and reliability.

The Game Plan: Smart Parallelization for Lightning-Fast CI

Our proposed solution isn't just about flipping a switch to "parallelize everything." Oh no, that would be a recipe for disaster! Instead, we're going for a smarter, more surgical approach. We'll be meticulously analyzing each and every integration test, categorizing them based on their resource usage and dependencies. The idea is to selectively enable parallel execution for those tests that can safely run concurrently without stepping on each other's toes. Simultaneously, we'll ensure that tests with shared state or external resource constraints continue to run in a serial fashion, or with specific isolation mechanisms. This hybrid approach allows us to reap the benefits of parallelism where possible, while maintaining the stability and reliability that's crucial for our CI. Think of it like optimizing traffic flow: instead of shutting down all lanes, we're opening up new lanes for compatible vehicles while keeping specific routes controlled. This targeted strategy is key to achieving significant speed improvements without introducing new headaches, making our development cycles much smoother and faster for everyone involved in Langstar and Codekiln.

Learning from the Best: What Reference Repositories Taught Us

Before we dive headfirst into implementation, we took a good look around to see how other well-respected projects handle their testing needs. It’s always smart to learn from others, right? We specifically looked at how LangSmith SDK manages its tests, especially those requiring true isolation. What we found was pretty insightful: LangSmith SDK uses cargo-nextest for tests that need process-level isolation. This isn't just about running tests in separate threads; it's about running them in entirely separate processes, which provides a much stronger guarantee against shared state issues. This is particularly relevant for PyO3 tests, which interact with Python and can have complex state interactions, much like our API integration tests with external state. Standard cargo test doesn't offer this level of isolation out of the box, making cargo-nextest a powerful tool for specific, high-isolation scenarios.

Our deep dive into various reference repositories, including popular CLI tools like ripgrep and bat, revealed a few key patterns. Most projects effectively use standard cargo test combined with its built-in features for everyday testing. For those tricky situations where serial execution is absolutely necessary, the serial_test crate emerged as a standard Rust practice. It's a neat, attribute-based solution that allows you to mark specific tests as serial, letting the rest run wild and free (in parallel!). Interestingly, we didn't see many reference repos adopting cargo-nextest workspace-wide; it was typically reserved for very specific isolation needs, like the LangSmith SDK's PyO3 requirements. This validation gives us confidence that a hybrid approach – using serial_test for fine-grained control and cargo test for default parallelism – is a solid and proven path forward for Langstar and Codekiln. The full analysis, by the way, is meticulously documented in our reference-repos-test-analysis.md for anyone who wants to dive into the nitty-gritty details.

Deconstructing Our Tests: What Can Go Parallel, What Can't?

Alright, let's get into the nitty-gritty of our existing test suite. Understanding which tests play nicely together and which ones demand their own spotlight is absolutely crucial for this parallelization effort. We've gone through our ~367 integration tests with a fine-tooth comb, and here's the breakdown of what's safe for parallelization (with a few tweaks), what must remain serial, and what's always safe. This categorization is the bedrock of our strategy, ensuring we get the speed boost without introducing flakiness or unpredictable behavior.

✅ Safe for Parallelization (with Modifications)

This is where we'll see the biggest gains, folks! A large chunk of our integration tests can absolutely run concurrently, provided we make some smart adjustments to how they handle resource naming.

1. SDK Integration Tests (~100 tests)

A significant portion of our SDK tests are prime candidates for parallel execution. The good news is that many of these already use a clever mechanism: unique, timestamp-based resource names. This is a brilliant start because it means they're already designed to avoid stomping on each other when creating resources.

sdk/tests/graph_integration_test.rs: These tests leverage unique deployment names, which is fantastic for parallel runs.
sdk/tests/dataset_test.rs: When creating datasets, they generate unique identifiers, again, making them suitable for concurrent execution.
sdk/tests/evaluations_test.rs: Similarly, these tests create unique evaluation runs, minimizing potential conflicts.
sdk/tests/runs_query_test.rs: These are inherently safe because they're primarily read-only operations, querying existing data without modifying anything. Read operations are almost always safe for parallelization, as they don't change state.
sdk/tests/structured_prompts_integration_test.rs: These tests create prompts with unique names, further reducing collision risks.
sdk/tests/playground_settings_integration_test.rs: These involve CRUD (Create, Read, Update, Delete) operations, but because they often use unique names for their resources, they can largely operate in isolation.

Currently, we mitigate name collisions by using a generate_test_name function that appends a timestamp. While this has served us well, here's the catch: SystemTime::now().duration_since(UNIX_EPOCH).as_secs() only provides second-level precision. If two tests happen to kick off and call this function within the same second, they'll generate identical names, leading to collisions. This is a real risk when you're suddenly running many tests in parallel.

Solution: We need to enhance our unique naming strategy. We'll upgrade our generate_test_name to include microsecond precision and, for an extra layer of robustness, append a short UUID suffix. This combination virtually eliminates the chance of name collisions, even with highly concurrent test runs. This small but crucial change unlocks a huge chunk of our SDK tests for parallel execution, dramatically speeding up that part of our CI.

2. CLI Command Tests (~200 tests)

Many of our CLI tests are more akin to unit-style tests rather than full-blown API integration tests. This is excellent news because it means a significant portion of them don't even hit the external API.

cli/tests/dataset_command_test.rs: Largely focuses on help output and input validation rather than live API calls.
cli/tests/eval_command_test.rs: Similar to dataset tests, these often validate command-line arguments and usage messages.
cli/tests/runs_command_test.rs: Again, many of these are about the CLI interface itself, not interacting with a backend.
cli/tests/prompt_scoping_test.rs: These are primarily read-only tests, checking existing prompt configurations.
cli/tests/prompt_structured_test.rs: Another set of read-only tests, verifying prompt structures without modifying anything.

Safe for parallel: All these non-API-hitting tests can run in parallel immediately without any modifications. They operate purely on local state or mocked environments, making them inherently independent. This is a huge win for immediate speed gains!

⚠️ Must Remain Serial

Now for the tests that need a bit more care. These are the ones with intertwined state or significant external resource dependencies that simply cannot (or should not) run concurrently without proper isolation.

1. Assistant Command Tests

These tests are a classic example of why we initially went with --test-threads=1. They heavily rely on a single, shared deployment managed by a OnceLock<TestDeployment>.

// cli/tests/assistant_command_test.rs:10
static TEST_DEPLOYMENT: OnceLock<TestDeployment> = OnceLock::new();

Why serial: Imagine multiple tests trying to create, update, and then delete assistants all on the same deployment at the same time. It would be an absolute mess! Each test would interfere with the others, leading to race conditions, unexpected states, and flaky failures that are incredibly hard to debug. These tests are designed to operate on a consistent, known state of TEST_DEPLOYMENT.

Options:

Keep --test-threads=1 for this specific test file only: This is the simplest and least intrusive option. We can isolate this file to always run serially while the rest of the suite goes parallel.
Refactor to use per-test deployments: This is technically possible, but comes with a huge caveat: creating a new deployment can take anywhere from 1 to 3 minutes. If we did this for every assistant test, our CI would become even slower than it is now. This option is generally not recommended due to the significant time cost.
Use serial_test crate for fine-grained control: This is a very elegant solution. We can simply mark individual #[test] functions within this file as #[serial]. This allows other tests in different files to run in parallel, but ensures that all tests marked #[serial] run one after another, maintaining the integrity of our TEST_DEPLOYMENT. This is our preferred path for these.

2. Graph Lifecycle Tests

These tests in cli/tests/graph_command_test.rs are crucial for verifying the end-to-end lifecycle of deployments, from creation to deletion.

Why serial: The core problem here is that these tests verify deployment counts and list results. If you have one test creating a deployment while another is deleting one, and a third is asserting the total number of deployments, those assertions will be incredibly flaky and unreliable. The state of "how many deployments exist" is constantly in flux, making deterministic testing impossible. Furthermore, as noted with assistant tests, deployment creation itself is slow, taking 1-3 minutes. True parallelization for these actions isn't really possible in a meaningful way.

Options:

Mark these specific tests as serial with #[serial] attribute: Similar to the assistant tests, this is the most practical way to ensure their sequential execution without holding up the entire suite.
Use persistent test deployments instead of creating per-test: This is an interesting long-term solution. Instead of creating and deleting deployments within each test run, we could have a set of pre-provisioned, persistent test deployments. Tests would then interact with these known deployments. This dramatically speeds up individual test runs by removing the 1-3 minute creation/deletion overhead. However, it introduces complexity around managing and cleaning up these persistent resources, and ensuring they are in a known good state before each test. It's a bigger architectural shift, but one worth considering in future optimizations. For now, #[serial] is the quick win.

📊 Read-Only Tests (Always Safe)

These are the unsung heroes of parallelization! Any test that simply queries or lists existing resources without making any modifications is inherently safe to run concurrently.

Listing deployments
Listing datasets
Querying runs
Getting organization info

These tests don't alter the global state, so they can run to their heart's content, completely independent of other tests. They require no special modifications and will benefit from parallel execution immediately once --test-threads=1 is removed. This category provides instant speed-up, offering quick feedback on our data retrieval mechanisms.

How We'll Get There: Our Implementation Strategy

Okay, so we've identified the parallel-friendly and serial-required tests. Now, how do we actually implement this magical parallelization? We've got a couple of solid options, each with its own merits. Our primary goal is to maximize speed while maintaining rock-solid stability.

Option 1: Hybrid Approach with `serial_test` Crate (Recommended for now)

This is our go-to recommendation for an immediate, impactful, and relatively low-effort solution. The serial_test crate is a fantastic tool that allows for fine-grained control over test execution. You get to decide which tests run in parallel and which ones need to take their turn.

1. Add dependency: First things first, we'll add serial_test to our Cargo.toml as a dev-dependency:

[dev-dependencies]
serial_test = "3"

This makes the crate available specifically for our testing environment.

2. Mark serial tests: With the crate in place, marking a test as serial is as simple as adding an attribute:

use serial_test::serial;

#[tokio::test]
#[serial]  // This test will only run serially with other #[serial] tests
async fn test_assistant_create_basic() {
    // This test uses the shared TEST_DEPLOYMENT, hence needs to be serial.
    // Other non-#[serial] tests will run in parallel around it.
}

The #[serial] attribute acts as a lock. Any test marked with it will acquire an implicit global lock, ensuring that only one #[serial] test runs at any given time. However, tests without this attribute will continue to run in parallel with each other, even while a #[serial] test is executing. This is the beauty of it: maximum parallelism where safe, explicit serialization where needed.

3. Update CI: Finally, we update our GitHub Actions workflow to remove the --test-threads=1 flag. This will allow cargo test to use its default parallelism (which is typically based on the number of CPU cores).

# .github/workflows/ci.yml
# ...
- name: Run Integration Tests
  run: cargo test --features integration-tests --workspace --nocapture
# The --test-threads=1 flag is GONE!

By removing --test-threads=1, cargo test will automatically utilize multiple threads for tests that aren't marked #[serial]. The #[serial] attribute ensures that those critical tests still run one by one.

Benefits of this approach:

Most tests run in parallel: This is the big win! Expect significantly faster CI times.
Serial tests remain safe: The #[serial] attribute guarantees that critical sections are protected.
No major code refactoring required: This is a huge plus, as it minimizes the risk of introducing new bugs. We're just adding an attribute, not rewriting test logic.
Simple to understand and implement: Rust developers are generally familiar with attributes, making this an easy pattern to adopt.
High confidence: As validated by reference repos, serial_test is a widely accepted and robust solution for selective serial execution in Rust.

Option 2: Cargo-Nextest with Profiles (Recommended if adopting nextest)

This option is a bit more involved as it suggests adopting cargo-nextest, a powerful next-generation test runner for Rust. It's a fantastic tool, especially if we're looking to enhance our testing infrastructure further, possibly tying into future work like issue #TBD (Test Runtime Tracking). cargo-nextest offers superior capabilities, including true process-level isolation, which is a step up from thread-level isolation provided by cargo test.

1. Create .config/nextest.toml: cargo-nextest uses a central configuration file (.config/nextest.toml) to define profiles and override settings.

# .config/nextest.toml
[profile.default]
test-threads = "num-cpus" # Default to using all available CPU cores

[profile.integration]
test-threads = 1  # Start with serial as a default for integration tests

# Override for specific tests
[[profile.integration.overrides]]
filter = 'test(^(?!assistant))'  # All tests EXCEPT assistant tests
test-threads = 4  # These can run with 4 parallel threads

[[profile.integration.overrides]]
filter = 'test(assistant_command_test)' # Specifically assistant command tests
test-threads = 1 # Keep these serial

This configuration allows us to set a default parallelism (or serial execution) and then define overrides using powerful filter patterns. For instance, test(^(?!assistant)) means "any test whose name does not start with 'assistant'". This provides incredible flexibility.

2. Update CI: The CI workflow would then be updated to use cargo nextest instead of cargo test:

# .github/workflows/ci.yml
# ...
- name: Run Integration Tests with Nextest
  run: cargo nextest run --profile integration --features integration-tests

We specify the integration profile, which will pick up our custom parallelism settings.

Benefits of this approach:

Centralized configuration: All test parallelism rules are in one easy-to-manage file.
Better test isolation: cargo-nextest runs tests in separate processes, which is a stronger isolation guarantee than separate threads. This nearly eliminates shared memory issues.
Can partition tests across CI workers: With cargo-nextest, it's easier to split tests across multiple CI jobs, allowing for even greater parallelization across different machines.
Validated by LangSmith SDK: This is a huge endorsement. LangSmith SDK uses cargo-nextest for its high-isolation needs, proving its capability for complex integration scenarios.
Future-proof: If we ever need more advanced test management features (like retries, custom reporters, or shard distribution), cargo-nextest has us covered.

While cargo-nextest offers more power, it also introduces a new tool to our stack. If we're not ready to commit to it workspace-wide, Option 1 is a perfectly robust and simpler starting point. However, if the discussion around Issue #TBD leads to adopting cargo-nextest, then this approach becomes the top recommendation for its superior isolation and configurability.

Option 3: Split Integration Tests into Groups (Less Recommended)

This approach involves physically separating our parallel-safe tests from our serial-required tests into distinct CI jobs. While it provides clear separation, it adds complexity.

jobs:
  integration-tests-parallel:
    runs-on: ubuntu-latest
    steps:
      - name: Run Parallel Integration Tests
        run: cargo test --features integration-tests --workspace --exclude langstar-assistant
        # Runs with default parallelism (removes --test-threads=1)

  integration-tests-serial:
    runs-on: ubuntu-latest
    steps:
      - name: Run Serial Integration Tests
        run: cargo test --features integration-tests --test assistant_command_test -- --test-threads=1

Here, we would have one job that runs all integration tests except those related to the assistant (which we know need to be serial). A separate job would only run the assistant tests, explicitly forcing them into serial execution.

Benefits:

Clear separation of concerns: It's very obvious which tests are parallel and which are serial.
Parallel tests finish quickly: The main bulk of tests aren't held up by the serial ones.
Simple flags for each job: No new crates or complex configuration files.

Drawbacks:

More complex CI configuration: Our .github/workflows/ci.yml would grow, potentially becoming harder to read and manage.
Harder to maintain as tests evolve: If we add new serial tests, we have to remember to update both the exclude list in the parallel job and potentially add new serial jobs. This can become a maintenance burden.
Increased CI overhead: Running two separate jobs incurs some overhead (checkout, setup, etc.) twice, potentially negating some of the speed gains compared to a single, smart test run.

For these reasons, we generally lean towards Option 1 or 2 as they offer a better balance of power, simplicity, and maintainability.

Keeping It Smooth: Risk Mitigation Strategies

Any time you change how tests run, especially involving parallelism, you introduce potential risks. But don't worry, folks, we've thought this through! We have several strategies to mitigate the common pitfalls of parallel testing, ensuring our CI remains stable and our test results reliable.

1. Prevent Name Collisions: Supercharging Unique Naming

The biggest headache with parallel integration tests hitting external APIs is resource name collisions. Our current timestamp-based naming is good, but as we saw, it's not foolproof for tests starting within the same second. We need to make it bulletproof.

Our enhanced strategy: We'll upgrade our generate_test_name function to include both microsecond precision and a short UUID suffix. This combination makes the chance of a collision astronomically small.

use uuid::Uuid;
use std::time::{SystemTime, UNIX_EPOCH};

fn generate_test_name(prefix: &str) -> String {
    let timestamp = SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .expect("Time went backwards") // Should never happen unless system time is messed up
        .as_micros(); // Now we're getting microsecond precision!
    let uuid_suffix = Uuid::new_v4().to_string()[..8].to_string(); // A short, unique suffix
    format!("{}-{}-{}", prefix, timestamp, uuid_suffix) // Combining all for ultimate uniqueness
}

This new generate_test_name will create names like my-test-1678881234567890-a1b2c3d4. This level of uniqueness ensures that even if a dozen tests start at the exact same microsecond, their generated names will still be distinct, preventing resource conflicts in our external services. It's a small code change with a massive impact on parallel test reliability.

2. Handle API Rate Limits: Smart Retries

Even with unique naming, a sudden burst of parallel API calls can trigger rate limits from our external services. We don't want our CI to fail simply because we're being too fast!

Our solution: Implement exponential backoff with retries for API operations that are known to be rate-limited. This means if an API call fails due to a rate limit, we'll wait a little, then try again, doubling the wait time for subsequent failures.

use tokio::time::{sleep, Duration}; // For async sleeps

async fn create_with_retry<T, F, Fut>(
    operation: F, // The async operation that might fail
    max_attempts: u32,
) -> Result<T, YourCustomError> // Assuming a custom error type that can identify rate limits
where
    F: Fn() -> Fut,
    Fut: std::future::Future<Output = Result<T, YourCustomError>>,
{
    let mut attempts = 0;
    loop {
        match operation().await { // Await the operation
            Ok(result) => return Ok(result),
            Err(e) if e.is_rate_limit() && attempts < max_attempts => {
                attempts += 1;
                let delay = Duration::from_secs(2u64.pow(attempts)); // Exponential backoff: 2, 4, 8, 16... seconds
                println!("Rate limit hit, retrying in {:?} seconds (attempt {})", delay.as_secs(), attempts);
                sleep(delay).await; // Asynchronously wait
            }
            Err(e) => return Err(e), // If not a rate limit, or max attempts reached, propagate error
        }
    }
}
// Note: You'd need a custom error type and a way to detect if 'e' is a rate limit error.
// Example usage:
// let my_resource = create_with_retry(|| api_client.create_resource().await, 5).await?;

This create_with_retry helper function will gracefully handle transient rate limit errors, making our tests more robust against external service constraints. It's a common and highly effective pattern in distributed systems.

3. Use Persistent Test Resources (Long-Term Consideration)

While serial_test and cargo-nextest address isolation, the fundamental problem of slow resource creation (like deployments taking 1-3 minutes) remains. For tests that must interact with a deployment but don't need to test its creation/deletion lifecycle, we can significantly speed things up by reusing persistent resources.

The idea: Instead of creating a brand-new deployment for every single test run, we could have a single, long-lived test deployment that is created once, perhaps manually or by a separate setup script, and then reused across multiple CI runs.

/// Reuse persistent test deployment across test runs for certain tests.
const PERSISTENT_TEST_DEPLOYMENT_NAME: &str = "langstar-integration-test-persistent";

async fn get_or_create_test_deployment_once() -> Deployment {
    // This function would check if PERSISTENT_TEST_DEPLOYMENT_NAME exists.
    // If it does, fetch it.
    // If not, create it.
    // Importantly: it would *never* delete it.
    // Cleanup would be handled by a separate, perhaps nightly, job.
}

Benefits of persistent resources:

Significantly faster test execution: Eliminates the 1-3 minute wait for deployment creation on every test run.
Fewer API calls: Reduces the load on external APIs, further mitigating rate limit risks.
Reduced resource churn: Less creation and deletion means less chance of hitting transient service issues related to resource provisioning.

Considerations: This approach requires a robust cleanup strategy (e.g., a nightly cron job to delete old persistent deployments) to avoid resource accumulation and cost. It's a bigger architectural change but offers substantial long-term speed benefits for specific types of tests. This would be a Phase 5 optimization, beyond the initial parallelization.

What to Expect: Performance Gains

Let's talk numbers, guys! What kind of speed boost are we actually looking at?

Current State:

Execution: All integration tests run serially.
Test Count: ~367 integration tests.
Estimated Time: Based on our CI timeout of 15 minutes, the full integration test suite currently takes anywhere from 5 to 10 minutes to complete. This is a significant chunk of our total CI time.

After Parallelization (Our Goal):

Once we implement our chosen strategy (likely serial_test first, potentially cargo-nextest later), we anticipate a dramatic reduction in test execution time.

Unit Tests: Good news here! Our unit tests are already running in parallel, so they are generally quite fast and won't see much change from this initiative.
SDK Integration Tests: These are our big win! With around 100 tests that can largely be parallelized (thanks to improved unique naming and removing the --test-threads=1 constraint), we expect a 50-70% reduction in their execution time. Imagine these flying through in a fraction of the current time!
CLI Integration Tests: Another huge chunk, with roughly 200 tests. Many of these are non-API-hitting and can run in parallel immediately. For the API-hitting ones, unique naming and smart retries will allow them to scale. We're looking at an estimated 60-80% time reduction for this category.
Assistant Tests & Graph Lifecycle Tests: A smaller subset, but crucial. These will still run serially (or with very limited parallelism for specific parts) due to their shared resource dependencies and slow deployment operations. However, because they are a smaller number of tests, their serial execution won't hold up the entire pipeline as much as it does currently. Their individual run times won't change dramatically, but the overall impact on the total suite will be minimized.

Expected Total Integration Test Time: We're aiming to bring down the total integration test execution time from the current 5-10 minutes to a lean, mean 2-5 minutes. This would be a massive win, providing much faster feedback to developers, reducing CI queue times, and making our entire development process more agile. This means less waiting for PRs to merge, and more time actually building awesome features for Langstar and Codekiln!

The Roadmap: Implementation Phases

To make sure we tackle this systematically and safely, we've broken down the implementation into clear phases. We've already knocked out a big one, which is awesome!

Phase 1: Identify and Categorize (1-2 hours) ✅ COMPLETED

[x] Analyze all integration test files: We've meticulously reviewed sdk/tests/ and cli/tests/ to understand their dependencies.
[x] Document shared resources and dependencies: We know exactly which OnceLock instances and deployment-specific interactions exist.
[x] Identify tests that MUST be serial: The Assistant and Graph Lifecycle tests are clearly marked.
[x] Identify tests that CAN be parallel: The majority of SDK and CLI tests fall into this category.
[x] Validate approach against reference repositories: We've learned from LangSmith SDK, ripgrep, and bat, confirming our strategies are sound.

This phase was critical for laying the groundwork and building confidence in our approach. Great job, team!

Phase 2: Improve Test Isolation (2-4 hours)

This phase focuses on making our parallel-safe tests truly robust against collisions and external issues.

[ ] Add microsecond precision to generate_test_name(): This is our first line of defense against name collisions. We'll update the helper function as discussed.
[ ] Consider UUID suffix for additional uniqueness: For that extra layer of bulletproofing, we'll integrate a short UUID suffix.
[ ] Implement retry logic for rate-limited operations: We'll introduce the create_with_retry pattern for API calls known to hit rate limits, making them more resilient.
[ ] Document test resource requirements: Updating our internal docs to clearly state best practices for naming and resource handling in new integration tests.

Phase 3: Enable Selective Parallelization (2-4 hours)

This is where we actually implement the core parallelization strategy.

[ ] Choose implementation strategy: We're leaning heavily towards Option 1 (serial_test crate) as our initial high-confidence step, given its simplicity and effectiveness. If a decision is made to adopt cargo-nextest more broadly, we'd pivot to Option 2.
[ ] Add serial_test dependency OR configure nextest profiles: Based on the chosen option, we'll either add the dev-dependency or create the .config/nextest.toml file.
[ ] Mark serial tests with #[serial] attribute: Go through the identified serial tests (Assistant, Graph Lifecycle) and add the #[serial] attribute to their test functions.
[ ] Update CI workflow to allow parallel execution: Remove the infamous --test-threads=1 flag from our .github/workflows/ci.yml.
[ ] Test in CI to verify no flakiness: Run several CI cycles and closely monitor for any new, intermittent failures. This is a crucial validation step.

Phase 4: Monitor and Optimize (Ongoing)

Parallelization isn't a "set it and forget it" kind of deal. Continuous monitoring is key.

[ ] Track test execution times: This ties into a related future issue about test runtime tracking. We need metrics to confirm our improvements and identify new bottlenecks.
[ ] Identify remaining slow tests: Even with parallelization, some tests might still be disproportionately slow. We'll find them and investigate further optimizations.
[ ] Gradually increase parallelism as confidence grows: If we start conservatively, we can slowly experiment with more aggressive parallelism (e.g., higher test-threads values if using nextest) once stability is confirmed.
[ ] Document lessons learned: Share our findings and best practices with the team.

Total Estimated Effort: We're looking at roughly 1-2 days for the initial implementation (Phases 2 and 3), followed by ongoing monitoring. This is a small investment for a significant return in development velocity!

Feeling Confident: Our Assurance Levels

We're not just guessing here, folks! Our confidence in this approach is built on solid research and established Rust best practices.

Approach	Confidence	Evidence
`serial_test` crate	✅ High	This is a widely adopted, standard Rust practice for selectively marking serial tests. Its usage is straightforward and well-documented.
`cargo-nextest` (if adopting)	✅ High	The fact that LangSmith SDK uses `cargo-nextest` for its process-level isolation requirements is a strong validation. It's a robust, production-ready tool.
Improved unique naming	✅ High	We're already using timestamp-based naming, and adding microsecond precision + UUID suffixes is a direct enhancement of an existing, proven pattern. It's a low-risk, high-reward change.
Parallel by default	✅ High	Simply removing `--test-threads=1` allows `cargo test` to use its default parallelism. This is how `cargo test` is designed to work for independent tests. Our `serial_test` marks will then handle the exceptions.

This blend of proven tools and sensible enhancements gives us very high confidence that we can achieve our goals without introducing instability.

Success! Our Acceptance Criteria

How will we know we've nailed this? Here are the key indicators of success:

[ ] Integration tests categorized as parallel-safe or serial-required: This foundational work is already largely done, as covered in Phase 1.
[ ] Improved unique naming prevents resource collisions: Verified by running parallel tests without unexpected failures due to name conflicts.
[ ] serial_test crate integrated (or nextest profiles configured): The chosen tool is successfully implemented and configured.
[ ] Serial tests explicitly marked with #[serial] or equivalent: All tests identified in Phase 1 as needing serial execution have the appropriate markers.
[ ] CI updated to allow parallel execution by default: The --test-threads=1 flag is gone!
[ ] CI test execution time reduced by >50%: This is our primary metric. We'll be tracking this closely to ensure we hit our performance targets (aiming for 2-5 minutes total for integration tests).
[ ] No test flakiness introduced: Critical! We can't trade speed for unreliable results. Any new intermittent failures will be addressed immediately.
[ ] Documentation updated with parallelization strategy: New developers need to understand how our tests work and how to write new ones that respect the parallelization rules.
[ ] Test results still reliable and deterministic: At the end of the day, our tests must accurately reflect the state of our codebase.

The Journey Continues: Related Issues & Resources

This effort is part of a broader push to make our development workflow as smooth and efficient as possible.

#508: Integration tests running slowly: This is the immediate trigger for this initiative, and we expect this effort to resolve it.
Future: Issue for test runtime tracking: A companion issue that will focus on implementing robust metrics and dashboards to monitor test performance over time. This will be crucial for ongoing optimization.

References to dive deeper:

serial_test crate documentation: _https://docs.rs/serial_test/latest/serial_test/_
cargo-nextest documentation: https://nexte.st
Rust Testing Best Practices: _https://doc.rust-lang.org/book/ch11-00-testing.html_ (always a good read!)
Current CI Configuration: .github/workflows/ci.yml:66-90 (where all the magic happens!)
Test README: cli/tests/README.md (for CLI-specific testing notes)

A Few Final Thoughts

Just a quick reminder, guys: while we're super excited about the speed gains, it's always smart to start conservative. When we first enable parallelization, we might mark a few more tests as serial than strictly necessary, just to be safe. Then, as our confidence grows and we see stable results, we can gradually enable more and more parallelism. Always monitor for flakiness – that's our canary in the coal mine! And remember, tools like cargo nextest offer process-level isolation which is generally superior to thread-level, so keep that in mind for future enhancements. Finally, even with parallel execution, API rate limits might still be a factor, so our retry logic will be a constant guard. The goal is faster, more reliable CI, and with this plan, we're well on our way to achieving it for Langstar and Codekiln!