From biosensors to LLMs: 20 years of measuring things that are hard to measure

In 2003, I was working on lipid bilayers and silicon microelectrode substrates in Japan, using instrumentation that forced you to think carefully about what could be observed and what could only be inferred. Two decades later, I found myself asking similar questions about large language models and AI-assisted workflows.

The scale changed. The vocabulary changed. The underlying intellectual problem did not: how do you distinguish signal from noise inside a system that does not reveal itself cleanly?

Measurement travels well across domains

My career moved through biosensors, synchrotron radiation work, neuroscience instrumentation, applied R&D, cloud platforms, and AI systems. On paper those look like separate worlds. In practice, each one punished lazy measurement and rewarded careful framing.

That continuity matters because many AI discussions today are still too software-centric. They focus on features and benchmarks before asking what exactly is being measured and what failure should count as unacceptable.

AI validation is a measurement discipline

When an LLM behaves badly in a workflow, the issue is rarely just “the model made a mistake.” The better question is under what conditions the mistake appeared, how often it appears, whether the surrounding system can detect it, and what operational consequence it creates.

Those are measurement questions. Anyone who has spent years around sensors, experimental systems, or neuroscience tools recognizes the pattern immediately.

Failure modes are often more valuable than averages

Another lesson that carried over from research is that averages can hide the most interesting behavior. A system can look acceptable on aggregate while failing badly in the cases that matter most. Healthcare systems teach this brutally well.

That is why evaluation needs both summary signals and edge-case discipline. Teams that only optimize the headline number often discover too late that the system is fragile where users most need confidence.

Measurement is not what you do at the end to see whether a system worked. Measurement is the system by which you decide what “worked” was allowed to mean.

Why leaders should build this instinct

Engineering leaders do not all need research careers. They do need a better measurement instinct as AI becomes operationally important. Without that, teams confuse surface fluency with reliability and speed with evidence.

The advantage of a research background is not prestige. It is the habit of asking whether the thing you are observing is really the thing you think you built.