Sai Swapna Gollapudi

Who I Am

About Me

I'm a Data Scientist at Amazon, where I build machine learning systems that operate at the intersection of scale and precision. My work spans personalized learning platforms, fraud detection, search relevance, and LLM evaluation — each touching millions of users globally.

With a Master's in Data Science from Indiana University Bloomington and nearly a decade of industry experience, I bring both theoretical rigor and hands-on engineering depth to every problem I tackle.

My research on Abuse Detection was accepted and presented at the Amazon Machine Learning Conference (AMLC) 2025 — a recognition of the novel multi-layer ML architecture I designed for enterprise-scale fraud pattern detection.

Outside of model building, I'm passionate about translating complex ML outputs into clear, actionable insights for business stakeholders — bridging the gap between research and real-world impact.

🧠 Large Language Models

Designing LLM-driven pipelines for skill extraction, feature engineering, code generation, and evaluation at production scale using Claude, GPT, and open-source models.

⚙️ ML Systems Engineering

End-to-end model development on AWS SageMaker — from experimentation on TB-scale datasets to productionized, monitored, and scalable deployments.

🔍 Search & Recommendations

Built semantic search and learning-to-rank systems replacing legacy BM25 models, with measurable NDCG metric improvements and stakeholder-driven A/B test validation.

📊 Anomaly & Fraud Detection

Designed multi-layer unsupervised ML architectures using Isolation Forest, Autoencoders, and DBSCAN to detect emerging fraud patterns across high-volume enterprise data.

Career

Work Experience

Amazon

Data Scientist

March 2022 — Present

Personalized Learning Experience (Learn)

Designed foundational ML architecture for a personalized learning platform processing 3M+ training documents
Built LLM-driven skill extraction pipeline using prompt engineering on unstructured training content
Implemented BERT/RoBERTa embedding-based skill normalization with K-Means and HDBSCAN clustering
Productionized models on AWS SageMaker with scalability, reproducibility, and monitoring

PythonPyTorchBERTRoBERTaSageMakerK-MeansHDBSCANClaude Sonnet

⭐ AMLC 2025 — Accepted & Presented

Abuse Detection

Led end-to-end experimentation on TB-scale datasets with missing/inconsistent data transformations
Proposed multi-layer ML architecture: unsupervised anomaly detection + consensus model to reduce false positives
Built LLM pipeline to extract structured features from policy documents under zero-shot and few-shot constraints
Designed LLM-assisted Python code generation with automated correction for production migration

Isolation ForestAutoencodersDBSCANOne-Class SVMSageMakerLambdaGluespaCy

Effortless Resolution Indicator

Developed effort estimation framework analyzing 2M+ monthly employee contacts across multiple channels
Created composite effort score combining operational metrics with ML-extracted behavioral and linguistic signals
Applied PCA, t-SNE, and clustering to identify effort archetypes and explain score drivers

PCAt-SNEAutoencodersNLPPrompt EngineeringSQL

LLM Evaluation Pipeline

Developed evaluation pipeline for globally deployed chatbot processing 50,000+ query-response pairs daily
Reduced processing time from 30 minutes to 30 seconds per 100 messages via parallel processing optimization
Benchmarked against AWS Bedrock Guardrails for hallucination detection and retrieval quality

SageMakerSQSLambdaBedrock APIGlueQuickSight

Search Engine Development & Topic Modeling

Replaced BM25 model with learning-to-rank algorithm, demonstrating NDCG improvements in two weeks
Designed A/B testing framework to measure search impact and refine ranking strategies
Developed BERTopic + LLM taxonomy generation system for document classification

BERTopicLearning-to-RankA/B TestingClaude

Capital One

Senior Data Scientist

May 2021 — March 2022

Identity Theft Detection

Built ML models to identify identity theft in customer login process using CNN, Random Forest, XGBoost, LSTM
Built end-to-end data pipelines with feature engineering and statistical analysis
Developed Tableau dashboards to monitor model performance and evaluate metrics

CNNXGBoostLSTMPythonSQLTableau

Capital One

Data Science Intern

June 2020 — Aug 2020

Digital Footprint Pattern Classification

Built model to identify and classify patterns in customer digital footprint with clickstream data
Applied LSTM/GRU, association rule mining, and convolutional neural networks

LSTMGRUCNNAssociation Rule Mining

Tata Consultancy Services

Data Analyst

July 2016 — July 2019

Predictive Classification & Financial Analytics

Built predictive classification models for retail order return analysis using Logistic Regression, Decision Trees, Random Forest with SMOTE for imbalanced data
Built risk classification model for Account Receivables
Developed automated financial analysis dashboards using RStudio, Spark, SQL, and Hadoop

PythonSklearnSMOTERStudioSparkHadoopSQL

Get In Touch

Contact

Let's connect.

Whether you're interested in collaboration, research opportunities, or just want to talk AI and ML — feel free to reach out. I'm always open to meaningful conversations.

LinkedIn Profile ↗

Open to Opportunities

I'm currently based in the US and open to Data Scientist, ML Engineer, and AI Research roles — particularly those focused on LLMs, NLP systems, and large-scale ML infrastructure.

⭐ AMLC 2025 presenter — Abuse Detection with multi-layer ML architecture