how to design tests, iterate, and ensure robustness

from running dozens of experiments at a signal-processing-workflow — on brains, on hardware, on signals — i developed a sense for what separates experiments that produce useful data from experiments that produce noise.

the experiment design framework

define what you're measuring before you start

the most common failure: "not insanely intentional with testing. could do much better with understanding what is going on and for what reason."

concretely, before any test:

what signal am i looking for?
what would success look like in the data?
what would failure look like?
what are the confounds?

for our visual evoked potential tests, the checklist was:

one eye (monocular)
fixate on central target
dark room
70-100cm from screen
consider contrast
which test type (pattern reversal at 2Hz, onset/offset, flash)

control one variable at a time

we made the mistake of changing multiple things between tests — new stimulus pattern AND new filtering AND new electrode placement. when results changed, we couldn't tell why.

the better approach: "do something to get a VEP on myself first." test the simplest possible case. if that doesn't work, the problem is fundamental. if it does work, add complexity one variable at a time.

the validation ladder

from simplest to most complex:

can you see alpha waves with eyes closed? (if no, hardware is broken)
can you see a response to a flash on yourself? (if no, timing or processing is broken)
can you see a response on another person? (if no, setup or parameters might be off)
can you reproduce results on a second trial? (if no, noise is too high)
can you see differences between conditions? (this is the actual experiment)

skipping to step 5 without passing steps 1-4 was a mistake i made repeatedly.

iteration patterns

the research → test → analyze loop

"lots of iteration. lots of failure." the actual workflow:

research: read standards, look at what parameters others use
set up: configure hardware, write stimulus script, prepare environment
run: execute the test, usually 15-45 minutes
analyze: process data, plot, look at results
interpret: is this real signal or noise? compare to expected results.
adjust: change one parameter and go back to step 3

steps 3-6 repeat dozens of times. "tested checkerboard on myself, tried flash on myself, tried checkerboard on another person, old stuff was all noise."

when to pivot vs persist

"just check differences and try to maximize it" vs "maybe they are just good, we are reasonably satisfied." the tension between perfectionism and pragmatism.

heuristic: if you've tried 5 different parameter combinations and none work, the problem is probably not the parameters. step back and question the approach.

"tried many training things, didn't do much" — knowing when to pivot to a completely different method rather than tweaking the current one.

the timing problem

timestamp synchronization was a recurring nightmare. each device (EEG headset, stimulus presentation, sensors) has its own clock. getting them aligned:

tried using a wire to send sync pulses — "doesn't work because it sends a pulse that is picked up by the headset in a huge spike"
tried software timestamps — "current script might not be using some settings that are important"
tried logging approach — eventually worked but required careful validation

"need to get correct times" — without precise timing, epoch-averaging is meaningless. this is an unsexy but critical part of experiment-design that textbooks don't emphasize enough.

designing for robustness

the subject experience

"if want longer tests, distraction — brain not focusing." for human experiments, the subject's experience matters:

boredom causes attention drift
watching videos might confound the signal (occipital lobe activation, pupil changes from brightness)
need to balance test duration with data quality

"how to have them not be bored? video? inscapes might not be bad — could confound occipital lobe, could affect pupils due to brightness."

the engineering gamble

sometimes you have to make a call with incomplete information. "data analysis stuff: hard because big data, for robustness can't really use LLMs that much."

the meta-skill: knowing when you have enough data to make a decision vs when you're just pattern-matching on noise. "interpreting graphs — sometimes good but looks bad, sometimes bad but looks good. need to balance time scrutinizing graph and writing script."