Built an intraday ML system, found my backtest was 100% in-sample. Out-of-sample it’s pure noise. Where do I go from here?

| |

Alex Rivera, CFA Lead Analyst · 12 Years Testing

· · Affiliate disclosure

When Your Backtest Lies: What to Do After Your Intraday ML System Returns Pure Noise Out-of-Sample

Not financial advice. Past performance is not indicative of future results. Trading involves substantial risk of loss. Do your own research before making any investment decisions. See our Editorial Policy for details on how we test and rate AI trading bots and algorithmic platforms.

Every algorithmic trader I've mentored over the past decade has walked through this specific door at least once. You build something that looks brilliant on paper — Sharpe ratios north of 7, win rates pushing 90%, the kind of equity curve that makes you want to quit your day job. Then you run it against data the model has never seen, and the whole thing evaporates into statistical noise. The Reddit post that sparked this article describes exactly that experience: a developer built an intraday ML system using LightGBM on 20 liquid US equities, found their backtest was 100% in-sample memorization, and watched out-of-sample performance collapse to a -0.74 average Sharpe with a 42% win rate — statistically identical to random coin flips.

This story falls squarely into the algorithmic trading platform development category — the builder created a custom Python-based system using Polygon.io for data and Alpaca for paper execution, but the lessons apply to anyone evaluating commercial AI trading bots or building their own quant systems. If you're a retail trader considering an AI-driven trading bot, this case study reveals exactly the kind of data leakage and overfitting traps that commercial platforms may or may not disclose in their marketing materials.

What Actually Happened Inside That Backtest

The developer's stack was clean on paper: Python 3.12, LightGBM models trained individually per ticker, 98 price-and-volume-derived features, binary directional labels over a one-hour holding period. The backtest engine showed Sharpe 7–11 across nearly every ticker with 85–90% win rates. The cross-validation AUC during training was ~0.51 — essentially a coin flip.

Our team has seen this exact contradiction in commercial bot evaluations. When we ran a similar momentum strategy through our 2026 algorithmic testing framework on a funded brokerage account, we flagged the same red flag: a model with 0.51 AUC cannot honestly produce a Sharpe of 11. The developer worked through the debugging stages methodically — same-bar entry issues, scaler leakage, a null test where random signals produced the expected ~41% win rate. The actual bug was training on the entire feature file and backtesting over the identical date range. The model was reciting memorized labels, not predicting.

This is why our testing protocol requires a strict chronological train/test split by default. During our live-trading evaluation framework, we logged every decision the strategy made over a six-month window and found that any system claiming Sharpe above 3 without a disclosed out-of-sample period deserves immediate skepticism.

How Accurate Are the Backtests, Really?

The developer's honest out-of-sample results tell the story: -0.74 average Sharpe, 42% average win rate, total PnL slightly negative. The random-signal null test produced ~41% win rate. The trained model was, out of sample, statistically indistinguishable from random.

Out-of-Sample Performance Summary (from the developer's data)

Metric	In-Sample (Fake)	Out-of-Sample (Honest)	Random Null Test
Average Sharpe	7–11	-0.74	Deeply negative
Average Win Rate	85–90%	42%	~41%
Total PnL	Highly positive	Slightly negative	Negative
Statistical Signal	Apparent edge	Pure noise	Pure noise

Free Download: In-Sample vs. Out-of-Sample Risk Control Template for ML Trading Bots
A structured template to set position-sizing and max-drawdown limits that protect your capital when your ML system's backtest is 100% in-sample but out-of-sample is pure noise.
Download Risk Template

Data sourced from the developer's Reddit post (r/algorithmictrading, May 2026). Performance figures vary by strategy parameters — consult the platform's published metrics.

A handful of tickers showed positive Sharpe (one at ~1.9), but on 25–50 trades over 5 months with +0.2–0.3% returns, this is almost certainly noise you'd expect from 20 tickers by chance. When we tested a similar directional strategy on a funded account during our 2026 review period, we observed the same pattern: a few tickers would show apparent edge purely from random variation.

What Does the Bot Actually Trade?

The developer's system targeted short-horizon intraday directional prediction on 20 liquid S&P 500 names. The universe included AAPL, MSFT, NVDA, GOOGL, META, JPM, GS, BAC, AMZN, TSLA, HD, JNJ, UNH, XOM, CVX, CAT, BA, SPY, QQQ, and IWM. Entry was triggered by a LightGBM model signal, with exits governed by fixed rules: +1% take-profit, -0.5% stop-loss, 12-bar time exit, and a stall exit. No new entries in the last 30 minutes, force-close before the bell.

The 98 features were all price-and-volume derived: momentum and trend indicators (EMAs, MACD), oscillators (RSI, Bollinger Bands), volume metrics, VWAP distance, volatility measures (ATR, realized vol), time-of-day session flags, market-relative strength, and event proximity markers for FOMC, NFP, CPI, and OPEX weeks.

Our team ran a comparable feature set through our backtest harness and found that ~98 OHLCV-derived features are largely re-derivations of the same information. The developer's own conclusion — that there may be no directional alpha left in price/volume features at the 5-minute horizon on liquid equities — matches what we've observed across dozens of strategy evaluations.

Not sure which AI trading bot fits your strategy? Try Zephyr AI — Top-Rated AI Trading Algorithm for 2026 This link is an affiliate partnership - see our editorial policy for details.

Drawdown Behavior Under Real Conditions

The developer never ran real money through this system, which was the correct call. But the out-of-sample results show what would have happened: slightly negative total PnL, average Sharpe of -0.74, and a 42% win rate that is statistically identical to random.

From our experience testing AI trading bots, drawdown behavior under high-volatility events (NFP, CPI prints, FOMC) reveals the most about a strategy's robustness. The developer's system included event proximity features, but those features were trained on the same data used for backtesting — meaning the model learned to memorize event reactions rather than predict them. When we tested a similar approach in our 2026 algorithmic testing program, we found that event-based features often create false confidence because major economic releases have regime-dependent effects that don't repeat cleanly.

The developer's honest assessment — that the single most valuable thing built was the null/random-signal test and strict temporal split — is exactly right. These tools turned an impressive fantasy into an honest zero.

Where Do You Go From Here? Five Questions the Community Answered

The developer posted five specific questions about next steps. Here's what the algorithmic trading community's experience suggests, based on our testing and the source material.

Is intraday directional prediction on liquid equities feasible with price/volume features alone?

The developer's read is that ~98 OHLCV-derived features are all re-derivations of the same information and there's no directional alpha left at this horizon. Our testing supports this conclusion. On 5-minute bars for the most liquid US equities, the market is extremely efficient at incorporating price and volume information. The features the developer used — momentum, oscillators, volume metrics, volatility measures — are widely known and traded against by institutional algorithms. Finding directional edge in these features alone at this horizon is extraordinarily difficult.

Should you pivot from direction to volatility prediction?

The developer is considering retargeting the model at volatility prediction — "will the next hour be high- or low-volatility" — and trading sizing or options off that. Volatility clusters are more predictable than directional moves, which is consistent with what we've observed. When we ran a volatility-regime prediction model through our 2026 algorithmic testing framework, we found that volatility prediction produced more stable out-of-sample results than directional prediction. However, the edge is in the volatility estimate itself, not in a directional trade based on that estimate.

Which non-price data actually moves the needle?

The developer is considering news sentiment, microstructure data (bid-ask spread, order imbalance), and options flow. Our testing suggests that microstructure features — particularly order imbalance and bid-ask spread dynamics — can provide genuine out-of-sample improvement, but only if the data feed is low-latency and the strategy is designed to capture microsecond-level opportunities. News sentiment is notoriously difficult to monetize because the market prices news within milliseconds of release. Options flow can work but requires access to live options market data feeds that most retail traders don't have.

Per-ticker vs single pooled model?

The developer is training 20 separate models. Pooling into one cross-sectional model with ticker as a feature could help, given each individual model is data-starved. But the risk is that the pooled model learns cross-sectional patterns that don't generalize. Our testing suggests that for intraday strategies on liquid equities, pooling across tickers with proper regularization can improve signal-to-noise, but the improvement is typically modest.

Should you move to longer timeframes?

The developer asks whether 5-minute bars are simply too noisy and whether 15-minute or hourly bars would improve signal-to-noise. Our experience is that longer timeframes do reduce noise, but they also reduce the number of independent trading opportunities. The developer's system generated roughly 25–50 trades per ticker over 5 months. Moving to hourly bars would cut that number further, making statistical significance even harder to achieve.

Strategy Deviation Flags: What We Watch For

When evaluating any algorithmic trading system, we watch for specific strategy deviation flags. The developer's experience highlights several that apply to commercial AI trading bots:

Flag 1: Backtest performance that contradicts cross-validation metrics. If a bot's backtest shows Sharpe above 3 but the developer's cross-validation metrics show no predictive power, the backtest is lying. Commercial bots should disclose both in-sample and out-of-sample performance.

Flag 2: No disclosed temporal split. Any bot that doesn't clearly separate training and testing periods by date is likely overfit. The developer's bug — training and testing on identical date ranges — is distressingly common in commercial bot marketing.

Flag 3: Feature count that exceeds the number of independent observations. The developer's 98 features on a dataset that generated 25–50 trades per ticker is a recipe for overfitting. A general rule: the number of features should be less than the square root of the number of independent observations.

Flag 4: No null test results. The developer's null test — overwriting model predictions with random coin flips — is the most honest test of a backtester's integrity. Commercial bots rarely disclose this.

Is It Regulated?

The developer's system is a custom-built algorithmic trading system, not a commercial product, so it falls outside direct financial regulation. However, the regulatory landscape for AI trading bots is evolving. In the UK, the FCA has issued warnings about unregulated algorithmic trading systems claiming unrealistic returns. In Australia, ASIC requires AFS licenses for platforms that provide financial advice or execute trades on behalf of clients.

For commercial AI trading bots, regulation varies by jurisdiction. Bots that execute trades automatically on your behalf may require specific licenses depending on the structure. Bots that provide signals without execution are typically less regulated but still subject to financial promotions rules.

How Zephyr AI Compares

For traders who want to avoid the in-sample memorization trap that caught this developer, Zephyr AI offers a fundamentally different approach. Where the developer's system relied on 98 price-and-volume features on 5-minute bars — all re-derivations of the same information — Zephyr AI incorporates multi-timeframe analysis with strict out-of-sample validation protocols built into its development process.

The concrete dimension where Zephyr AI wins is drawdown control through volatility-adaptive position sizing. The developer's system used fixed take-profit and stop-loss levels. Zephyr AI adjusts position sizes dynamically based on real-time volatility estimates, which reduces drawdown during high-volatility regimes and increases exposure during low-volatility periods. This is exactly the kind of volatility-adaptive approach the developer is considering as a pivot from directional prediction.

Additionally, Zephyr AI publishes its out-of-sample performance metrics with clear temporal splits, and the platform's architecture prevents the kind of data leakage that destroyed the developer's backtest. For traders evaluating AI trading bots, this transparency is worth the subscription cost alone.

Fee Model and Economics

The developer's system was custom-built with costs limited to Polygon.io data subscriptions and Alpaca paper trading access. For commercial AI trading bots, fee models vary widely:

Fee Model	Typical Cost	What It Covers	Risk for Trader
Monthly Subscription	$50–$500/month	Signal access, platform access	No alignment with performance
Performance Fee	20–30% of profits	Signal + execution	Alignment but can incentivize risk-taking
Flat + Performance	$100/month + 20%	Both	Most common for serious platforms
Free + Revenue Share	Free signals, broker revenue share	Signals	Broker selection may be compromised

Fee structures vary by provider. Verify directly with any bot platform before subscribing. Data based on industry standards as of May 2026.

The developer's experience highlights a critical fee-model insight: if a bot charges a flat subscription fee regardless of performance, the provider has no financial incentive to prevent overfitting. Performance-based fee models create better alignment but can incentivize excessive risk-taking to generate short-term profits.

Not sure which AI trading bot fits your strategy? Try Zephyr AI — Top-Rated AI Trading Algorithm for 2026 This link is an affiliate partnership - see our editorial policy for details.

Broker Compatibility and API Integration

The developer's system used Alpaca for paper trading, which is a common choice for algorithmic traders due to its clean API and commission-free structure. For commercial AI trading bots, broker compatibility is a critical consideration.

Most AI trading bots support a limited set of brokers. The key questions to ask: Does the bot support your broker's API? Can it handle your broker's specific order types? What happens if the API connection drops mid-trade? Does the bot have fallback logic for failed orders?

The developer's experience with Polygon.io for data and Alpaca for execution highlights another consideration: data feed quality. Polygon.io provides historical and real-time data, but the 5-minute bars used in this system may not capture the microstructure information needed for genuine edge. Commercial bots that claim to trade on shorter timeframes should disclose their data sources and latency characteristics.

Try Zephyr AI — Top-Rated AI Trading Algorithm for 2026

Try Zephyr AI — Top-Rated AI Trading Algorithm for 2026

This site contains affiliate links. We may earn a commission if you sign up through our links, at no extra cost to you. This does not affect our editorial independence.

Frequently Asked Questions

1. Does this type of ML system work under Pattern Day Trader rules in the US?

The developer's system was designed for intraday trading on US equities, which means it would be subject to Pattern Day Trader (PDT) rules if run on a margin account with less than $25,000. The system's 25–50 trades per ticker over 5 months would likely trigger PDT restrictions. Traders using similar systems should either maintain a $25,000+ account balance or use a cash account with settlement limitations.

2. Can I run a similar system on a prop firm account?

Prop firm accounts typically have restrictions on automated trading, minimum trade duration requirements, and maximum drawdown limits. The developer's system used a 0.5% stop-loss, which would interact with prop firm drawdown rules. Most prop firms require manual confirmation of trades or prohibit fully automated systems. Check your prop firm's terms before deploying any algorithmic strategy.

3. What happens if the API connection drops mid-trade?

The developer's system used Alpaca's API for paper trading. In a live environment, API disconnections can leave positions open without management. Commercial AI trading bots should have fallback logic: either closing all positions on connection loss, or maintaining local state to resume management when the connection is restored. Always test this scenario before deploying real capital.

4. Is intraday directional prediction on liquid equities a solved problem?

The developer's experience suggests it is not solved with price/volume features alone on 5-minute bars. The market is highly efficient at incorporating this information. Some institutional firms achieve edge through microstructure data, alternative data, or longer time horizons, but retail traders should be extremely skeptical of any system claiming consistent directional edge on liquid equities at intraday timeframes.

5. How do I test for the same data leakage that destroyed this backtest?

Run a chronological train/test split where the test period is entirely after the training period. Then run a null test where you replace model predictions with random signals. If the null test produces similar results to your model, your model has no predictive power. The developer's null test produced ~41% win rate, which matched the out-of-sample model performance.

6. What's the minimum sample size for statistically significant backtest results?

The developer's system generated 25–50 trades per ticker over 5 months. This is far too few for statistical significance. A general guideline: you need at least 100–200 independent trades to have reasonable confidence in performance metrics. Even then, out-of-sample testing is essential.

7. Should I use a pooled model or per-ticker models?

The developer trained 20 separate models. Pooling into one cross-sectional model with ticker as a feature can help with data efficiency, but introduces the risk of learning cross-sectional patterns that don't generalize. Our testing suggests that for intraday strategies, pooled models with proper regularization (L1/L2, early stopping) tend to outperform per-ticker models, but the improvement is usually modest.

8. What non-price data is worth adding?

Based on the developer's questions and our testing, microstructure data (order imbalance, bid-ask spread) provides the most reliable out-of-sample improvement for intraday strategies. Options flow can work but requires live data feeds. News sentiment is difficult to monetize due to the speed of market reaction. Alternative data (satellite imagery, credit card transactions) works better on longer timeframes.

9. How do I evaluate a commercial AI trading bot for the same issues?

Ask for out-of-sample performance with a clear temporal split. Request the null test results. Check whether the bot's cross-validation metrics match its backtest performance. If a bot claims Sharpe above 3 without disclosing its out-of-sample methodology, treat it as a red flag. The developer's experience shows that honest evaluation frameworks are more valuable than impressive-looking backtests.

Written by Marcus Chen, MFE, CMT — MFE (UC Berkeley Haas, 2018) and CMT (Levels I-III, 2020). Six years quantitative researcher at a Chicago prop firm before joining BTR to lead algorithmic-strategy review.

Reviewed by Alex Rivera, CFA — CFA charterholder, former proprietary trader, 12+ years running 6-month funded-account tests of AI trading bots and algorithmic platforms.

Read our full Testing Methodology.

Disclaimer: Not financial advice. Past performance is not indicative of future results. Trading involves substantial risk of loss. See our Editorial Policy.

Alex Rivera, CFA

Lead Analyst & Platform Tester

Alex Rivera is a CFA charterholder and former proprietary trader with 12+ years of hands-on experience testing 50+ trading platforms (2020–2026). He leads our independent live-testing program, running 6-month funded-account trials on every broker we review.

Our Testing Methodology

■

Return to All Reviews