Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ayushnoori.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Artificial intelligence (AI) is rapidly entering clinical settings in a fragmented and largely unregulated manner. As of May 2025, at least 377 healthcare systems and providers in the U.S. have piloted or adopted 70 generative AI tools developed by 49 companies for clinical decision support, patient communication, documentation, claims processing, and healthcare administration. The majority of American physicians report using AI technologies in clinical care. Similar trends are observed globally: 48% of clinicians across 109 countries report AI use in their work, and large-scale deployments are emerging across health systems in China. This adoption reflects the strong performance of modern AI models on largely synthetic benchmarks. Large language models (LLMs) match or exceed physician performance in diagnostic reasoning, clinical text summarization, medical question answering, patient communication, and multi-step reasoning tasks. However, concerns about hallucinations, bias, and reliability remain. Real-world clinical performance remains poorly understood. More realistic benchmarks are emerging, but systematic evaluation in deployed settings remains rare. A central limitation is the lack of logging for each instance in which a medical AI model is used. Without consistent records of how models are used in practice, health systems cannot reliably assess performance, detect failure modes, or measure clinical impact. Reporting frameworks, including TRIPOD+AI, STARD-AI, DECIDE-AI, SPIRIT-AI, and CONSORT-AI, focus on model development and evaluation in controlled settings. They do not address continuous, event-level monitoring of deployed systems, especially generative AI and agentic workflows that rely on prompting, retrieval, and tool use. Several governance and evaluation frameworks have also been proposed. However, no broadly adopted standard records each instance of AI use in clinical care. This gap limits the ability to evaluate medical AI in practice and to govern its use. The FDA has emphasized that clinicians cannot feasibly oversee all outputs of generative AI systems. Medical AI therefore requires systematic monitoring. In other domains, centralized logging protocols such as syslog fill this role by recording each system event in a consistent way across many connected services. These logs support real-time monitoring, root-cause analysis, and auditing at scale. Medical AI has no equivalent standard. Here we introduce MedLog, a protocol for event-level logging of medical AI. MedLog specifies a schema for each model invocation, including inputs, outputs, intermediate artifacts, and, when available, clinical outcomes and user feedback. It records both single-step interactions and multi-stage agentic workflows by linking events over time. MedLog covers any AI process that uses health data and can affect patient outcomes, including AI-human and AI-AI interactions. This includes interactions between models and patients, clinicians, administrators, and other stakeholders; background services such as batch inference, autonomous triage, claim routing, and continuous monitoring; and AI-AI exchanges within agentic workflows and orchestration frameworks. We evaluate MedLog across four clinical deployments spanning intensive care monitoring, infectious disease severity prediction, hospital quality reporting, and patient attendance prediction. MedLog makes model behavior visible in practice. It reveals patterns that offline evaluation does not capture, including temporal failure modes, workflow-dependent variability, interactions between model outputs and clinician behavior, and performance degradation during severe weather events. By standardizing how health systems record AI use and link it to outcomes, MedLog creates a basis for continuous evaluation, auditing, and improvement of medical AI. As AI becomes embedded in clinical workflows, this infrastructure will be needed to measure real-world performance, detect failures, and guide deployment at scale.

Quickstart

Emit your first MedLog record in a few API calls.

The record schema

The nine fields of every MedLog record.

Deployments

Four clinical deployments monitored with MedLog.

API reference

Write-once event endpoints with an interactive playground.