About

This page contains a detailed overview of the past work I did.

πŸ’Ό Work Experience

Wolters Kluwer (Jan 2023 - Present) - Machine Learning Engineer

  • Borrower Analytics 2.0 (UCC filings)
    • Processed 65M UCC filings for information extraction from PDFs (>1B single-page images).
    • Fine-tuned and deployed a vision-language model to extract 50+ fields from UCC1/UCC3 across diverse state formats; achieved 95% extraction accuracy.
    • Deployed with a high-throughput serving stack at ~10 images/sec on H100 GPUs.
    • Built an LLM-as-a-judge framework to generate high-quality labeled data, significantly reducing human labeling effort.
    • Trained and deployed a transformer for text segmentation (97% recall) extracting complete collateral text.
    • Trained a lien-classification model (95% accuracy) based on collateral text.
  • IRA Knowledge-Base Chatbot (RAG)
    • Built a RAG chatbot over an XML-based IRA knowledge base for faster query resolution.
    • Implemented query transformation, parent–child retrieval, hierarchical indexing, and multi-vector retrieval in LanceDB.
    • Improved retrieval with ColBERT reranking and hybrid search (semantic + BM25).
    • Reached ~97% CSAT, improving user experience and response accuracy.
  • WK AI Studio (internal fine-tuning platform)
    • Built an end-to-end internal framework to fine-tune and deploy models to production.
    • Enabled non-technical teams to fine-tune models with their own data.
    • Integrated MLflow with full lineage, model registry, and dataset/model versioning for production tracking.
    • Supported distributed multi-GPU, 4-/8-bit training, LoRA/QLoRA, mixed precision, and differential learning rates, plus other SOTA techniques.
  • BLMS (Business License Match/Search)
    • Built a search engine to recommend required licenses for starting a business.
    • Developed a HyDE (Hypothetical Document Embeddings) workflow to expand short user queries into meaningful search text.
    • Used synthetically generated descriptions from an internal taxonomy and a reranker, achieving 96% retrieval accuracy.
  • Proviso (legal citations chatbot)
    • Built a RAG-based chatbot on 91k legal citations for efficient legal query resolution.
    • Enhanced retrieval with metadata filtering, query transformation, and similar techniques as IRA.
    • Used BERTTopic to generate hypothetical topics, which were leveraged for metadata filtering to reduce search space and improve query handling.
  • Earlier β€” Data Science Intern (Jan–Jul 2023)
    • Built an in-house key information extraction solution using a transformer-based stack.
    • Shipped a document classification pipeline.

Weights & Biases (May 2022 - Present) - Ambassador

  • Engineered optimized Kaggle notebooks with integrated W&B tracking and monitoring.
  • Authored technical reports showcasing W&B across medical imaging, visual-language models (Flamingo, BLIP-2), few-shot learning (SetFit), PyTorch 2.0, Hugging Face, RAG, and LLMs.
  • You can find all my W&B blogs here

πŸ“ˆ Competitions

I love participating in machine learning competitions. I am primarily active on Kaggle and am a Kaggle Competition Expert.

WSDM Cup β€” Multilingual Chatbot Arena (Kaggle, 2024) β€” 31/950 πŸ₯ˆ
  • Challenge was to develop a reward model (used in RLHF stage) for multilingual human conversations on the chatbot arena (formerly LMSYS).
  • Finetuned LLMs as reward models in classification setting and used various techniques like multi-stage training (pretraining, finetuning), pseudo labelling, LoRA, QLoRA, efficient inference techniques, knowledge distillation.
  • Competition Link | Code
Kaggle β€” LLM Science Exam (2023) β€” 123/2664 πŸ₯ˆ
  • Multiple-choice science questions; evaluated by MAP@3.
  • Dual retrieval: TF-IDF over curated Wikipedia corpora + dense retrieval (BGE-small-en v1.5) with FAISS.
  • Context assembly: join top-K passages into a single context string per question.
  • Answering model: DeBERTa-v3-Large as multiple-choice scorer (context + question paired with each option).
  • Ensembling: soft-average logits across checkpoints Γ— retrieval variants for the final MAP@3 submission.
  • Competition Link
DataSolve-India (Wolters Kluwer, 2022) β€” 1st place πŸ₯‡
  • The objective was to categorize regulations that are crucial for business compliance (multi-label classification).
  • Used weighted hill-climbing ensemble of transformer models (DeBERTa-v3, RoBERTa) and GBDT’s (CatBoost, XGBoost).
  • Competition Link | Code
U.S. Patent Phrase-to-Phrase Matching (Kaggle, 2022) β€” 31/1889 πŸ₯ˆ
  • The task was to extract relevant information by determining the semantic similarity between key phrases in patent documents.
  • Used hill-climbing ensemble technique, and a range of transformer models trained using different strategies to ensure diversity.
  • Competition Link | Code
Happywhale β€” Whale & Dolphin Identification (Kaggle, 2022) β€” 132/1588 πŸ₯‰
  • Individual re-identification using dorsal fin/marking signatures.
  • Curated and published resized training sets to accelerate iteration/stabilize training.
  • Built a robust visual re-ID pipeline (embedding model + nearest-neighbour matching), with heavy augmentation and careful per-individual CV to avoid leakage.
  • Iterated on mining schedules and validation sanity checks for consistent generalization.
  • Competition Link
Amazon ML Challenge (HackerEarth, 2021) β€” 11/3294
  • The task required categorizing products into browse node IDs for a large dataset consisting of 2.67GB of text and with 9k+ classes.
  • Used standard handcrafted features, sentence embeddings, TF-IDF and a custom neural network to merge all features.
  • Competition Link | Code (team)
Bristol-Myers Squibb β€” Molecular Translation (Kaggle, 2021) β€” 50/874 πŸ₯ˆ
  • The task was to interpret old chemical images and convert images back to the underlying chemical structure annotated as InChI text.
  • Used Vision Transformer (ViT) as encoder and original transformer decoder.
  • Generated 12M synthetic images with RDKit for better ViT performance.
  • Competition Link
Sartorius β€” Cell Instance Segmentation (Kaggle, 2021) β€” 117/1505 πŸ₯‰
  • Detect and delineate single neuronal cells in microscopy images (instance segmentation).
  • Implemented Detectron2-based Mask R-CNN pipeline (with tuned anchors/thresholds) as the primary model.
  • Added a parallel Cellpose track for comparison; tracked results and failures to guide post-processing.
  • TTA (flips/scales) and morphological post-processing (small-object cleanup, hole-filling) to refine masks.
  • Deployed a small demo app for qualitative review and error analysis.
  • Competition Link | Code

🌍 Open-Source Contributions


🎀 Talks

I frequently give talks on machine learning and MLOps topics. Most of my talks to colleges and small groups are not recorded, but here is one notable recorded presentation:

I’m open to giving talks! If you’re interested in having me speak at your event, please reach out to me at atharvaaingle@gmail.com.