Brittleness of snapshot tests

yawaramin · May 27, 2025, 2:56pm

Can you elaborate on this? What is brittle here?

shonfeder · May 27, 2025, 3:24pm

paths and line endings of course, devil is in the detail of how you are producing the strings. But the core principle, in my view, is just that

the meaning of a syntactically valid program in a “type-correct” language should never depend upon the particular representation used to implement its primitive types.
…
this property of representation independence should hold for user-defined types as well as primitive types

(Reynolds, Towards a Theory of Type Structure)

Ideally, I think we want to test that our programs mean what they are supposed to, and this should generally lead us to avoid attachment to particular representations. For me, this implies that, unless we are specifically trying to test the representations, when we want to test a value v : t we should try to test it in terms of some equal : t -> t -> bool and not via a to_string : t -> string that forces us to reduce all our meanings to the level of String.equal.

yawaramin · May 27, 2025, 3:29pm

Can you expand what is brittle in practical terms? I am having a difficult time understanding what you mean specifically. Why are snapshot or expect tests brittle because of paths and line endings?

But the core principle, in my view

Does this override every other consideration, like eg the convenience of being able to snapshot the exact logic of the code, inspect it at a glance, and immediately find changes thanks to diffing?

shonfeder · May 27, 2025, 3:43pm

Because paths and line endings are represented differently between POSIX and Windows. So if you have a test that converts data that contains, e.g., an Fpath.t to a string (say as part of producing JSON), then you will end up different results depending on what platform you run it on. Depending on how you produce your snapshots, you are likely to have different line endings depending on whether you generate them on Windows or a POSIX system. The encoding used to generate the snapshot may also differ. But these are all just special cases of all the complexity and brittleness that come with tying your tests to concrete representations.

my point is that, afaiu, you cannot do this by printing strings. A representation of some data in a string is not the exact logic of your code, unless the code is just a string.

Not for me! I am also not meaning to be giving anyone else imperatives, just sharing my view on the topic that was raised. But I am partial to working on things in ways that I think are sound correct over ways that are error prone but merely convenient, when I can. I often make exceptions for different reasons, but in discussion of principles I want to try to take the principled approach.

inspect it at a glance, and immediately find changes thanks to diffing

IMO, this is a good reason for having good quality reporting and diagnostics for failed tests, not for casting everything to strings and checking for string equality.

I have worked on code basis that use stringy fixtures and then produce errors as massive diffs. There are virtually impossible to get useful information out of: so strings provide no magic here

On the other hand, just because you test for quality of the values you actual mean to test, doesn’t mean you cannot report the difference via a diff between their string representation!

yawaramin · May 27, 2025, 4:22pm

Agreed, path representations can be different on Windows. But isn’t this rather easy to work around? Windows understands the forward-slash as a path separator character (for a long time now), I don’t see any reason to use backslash on Windows. And printing an absolute path as part of a test would be inherently non-portable because of the filesystem (drive) differences, so no one would do that anyway…

With line endings, I feel like it’s simple to always enforce Unix line endings–git can do it for you.

But sure, there is a little bit of extra complexity and brittleness that comes with snapshot testing. Eg, we have to be careful not to snapshot random values or current timestamps which will change on the next run. I feel like these are easy to find and fix though. Personally, I have been using snapshot testing for years and basically never run into these issues. You only have to fix up the format of the snapshot printer once; from then onwards it’s solid.

shonfeder · May 27, 2025, 5:31pm

I just had my lunch and was reflecting on the exchange here, and I’m afraid I fell into a bit of a dogmatic position. I don’t think there are absolutes here, or one way that is best under all circumstances. Snapshots certainly have their uses.

E.g., if you have an essentially invariant bijection between the things you want to test an a sensible string representation, I think it would be silly to maintain (as I’m afraid I was) that testing against snapshots of the expected strings is subpar just because we are using a concrete representation.

I do think that it is worth some careful consideration about what one is trying to test, and to consider the tradeoffs that come with relying on concrete representations. These days, largely thanks to have been bit a couple times by fussy snapshot testing, I a have general bias towards using maintaining and respecting abstraction boundaries where possible, and testing things in terms of the equality defined on their type. But I don’t want to claim a universal superiority of that approach.

But isn’t this rather easy to work around?

You can always add workarounds, yes. If the approach to testing is that you generate particular string representations of your data just to serve the needs of reliable tests, even when you are not actually trying to test the representations, I think one can make that work. But then of course you risk messing up your test logic via the intermediate translations.

That is nice to hear! I believe that if you prioritize trapshooting and build enough supporting conversions/masks/etc. for it, it can work well, even if that is not generally my preference these days.

That said, I do use cram style tests a lot, anytime I am working on CLIs, because, as I said, to my thinking that is a very appropriate and prime place for such tests.

mbarbin · May 27, 2025, 5:37pm

For the cases where resorting to strings has its shortcoming, there are a few utils I have seen in this expect-test-helpers library allowing you to mix and match styles (they’re called require_*, and print a failure message if the condition does not hold).

One workflow this permits is to output CR comments when the property are not satisfied (in case you accidentally promoted the output, or whether you want to leave TODO hooks for yourself in dev mode).

For example: (for_shon.ml):

let%expect_test "mixing snaphots with typed comparisons" =
  let expected = "tricky string" in
  let gotten = "ticky string " in
  print_endline gotten;
  [%expect {| ticky string |}];
  require_equal [%here] (module String) gotten expected;
  [%expect
    {|
    (* CR require-failed: notes/scratch/for_shon.ml:6:16.
       Do not 'X' this CR; instead make the required property true,
       which will make the CR disappear.  For more information, see
       [Expect_test_helpers_base.require]. *)
    ("values are not equal" "ticky string " "tricky string")
    |}];
  ()
;;

I’m hoping soon to be able to propose similar workflows to the open source community with this project that parses crs comments. Allowing for example:

$ crs grep
File "scratch/for_shon.ml", lines 9-12, characters 4-239:
  CR require-failed: notes/scratch/for_shon.ml:6:16.
  Do not 'X' this CR; instead make the required property true,
  which will make the CR disappear.  For more information, see
  [Expect_test_helpers_base.require].

Integrated in CI, review, editor, etc. these helpers are pretty convenient and ergonomic in my experience.

yawaramin · May 27, 2025, 5:50pm

Isn’t that just a debug pretty-printer in OCaml land? Eg what you would get with the Fmt.Dump module. It’s useful not just for testing but also exploration in the REPL and debugging of runtime values.

you risk messing up your test logic

This is always a risk with testing whether one uses snapshots or not.

build enough supporting conversions/masks/etc. for it

Usually this is not that much! Most snapshot tests revolve around some central data structures. Eg in React (SPA) land you snapshot the HTML output of the JSX React nodes. Here you just have to be careful that you are not capturing too much of the output, but only the parts relevant for testing.

shonfeder · May 27, 2025, 6:23pm

These seem quite useful. IIUC, this approach takes the view that you can start with expect-style testing, but then reinforce this with tests that cleave closer to your program semantics? Looking at, e.g., require_compare_equal, this basically gives what I’ve been advocating for: a test for equality is required as your source of truth, then you also have the expectation to give a diff on the string rep, which (in these cases), we’d expect to be equal for equal values. I’ve never looked at this library before, and seems very useful. Thanks!

Looking forward to that!

I would not assume a debug pretty-printer is a stable bijection between strings and values that could be relied on for robust tests. I’d assume it was a pretty printer for debugging.

If you are testing x : t and y : t with equal : t -> t -> bool then you are not risking that your comparison logic has introduced irrelevant errors.

I think my last word on this topic is:

I am glad that you find snapshot testing useful and that it works well for your purposes! I also use it sometimes, but sparingly, and mostly only where I am concerned with testing string representations. If I am going to have to massage stuff, I’d much rather spend that time on PBT. But I think there are many ways to do things. Cheers and thanks for the conversation!

dbuenzli · May 27, 2025, 10:23pm

FWIW personally I don’t equate snapshot testing with comparing strings, you can compare arbitrary OCaml values. Of course you are still comparing some form of representation because you need to store a reference value, but then that doesn’t have to be a string, you can use richer, but literal representations.

In my own testing framework you can snapshot regular OCaml values (examples, without ppx it should be stressed). And for diffing you can define your own diffing algorithm or simply use the default source text based one (which means that technically you can store some form of binary representation but still have nice diffs).

Property based testing is very cool, but then I often prefer to have explicit tests for edge cases (empty list, empty string, etc.) and bug reports. I find myself doing it more now that I have snapshot testing because I’m lazy. You start with an obviously simple but wrong answer (usually the neutral element of the type) and the snasphot corrects to what your implementation does.

shonfeder · May 28, 2025, 1:37am

This sounds like a very nice approach! Snapshots without the stringy stuffy and decent data diffing sounds great. Glad to learn about your testing library!

mobileink · May 28, 2025, 5:51pm

It’s not just encoding. Putting paths in your “expected” data may introduce a dependency on the build system. Different build systems put things in different places. I learned this the hard way when adding bazel support for ppx_expect et al.

dbuenzli · June 3, 2025, 8:28pm

7 posts were split to a new topic: Paths separators on windows

Topic		Replies	Views
Need feedback on my basic unit test Learning ppx , menhir , dune	11	233	May 27, 2025
Word diff in longer test_eq strings Community testing , ppx_inline_test	17	1224	January 22, 2023
How to set up unit testing in 2023 Learning unit-test	36	3665	October 9, 2024
What functions deserve tests? Learning language-agnostic	29	858	October 21, 2023
Build systems and snapshot testing support Community build , testing	20	235	March 31, 2025

Brittleness of snapshot tests

Related topics