Summary
Developed a custom data collection service to gain deeper insights into website traffic patterns and identity that were not visible through Google Analytics. This system leverages serverless architecture on Google Cloud to capture detailed HTTP request data, including headers, IP addresses, and timestamps. By generating unique user identifiers and organizing data efficiently, the pipeline facilitates seamless integration with analytics platforms like BigQuery and Looker. Additionally, I implemented data normalization and enrichment processes to enhance the quality and usability of the collected data for downstream analysis.
What I did
Serverless Data Collection: Built a Google Cloud Function to capture HTTP request data and store it as JSON files in Google Cloud Storage.
User Identification: Implemented unique userId cookies for tracking individual users across sessions.
Data Processing Pipeline: Created a secondary Cloud Function to iterate through event JSONs, normalize data, handle errors, and merge them into hourly files using Hive partitioning.
Database Integration: Configured a BigQuery external table to access processed data without incurring additional storage costs, enabling efficient querying and analysis.
Analytics Integration: Connected the BigQuery table to PowerBI to generate actionable insights and visualizations.
Data Enrichment: Developed an enrichment service using IPInfo to append geographic and network information to the collected IP addresses for enhanced analytics.
Scalable Architecture: Designed the pipeline to be scalable and maintainable, leveraging Google Cloud’s serverless offerings for reliability and performance.
Summary
I implemented a Retrieval-Augmented Generation (RAG) system that integrates multiple components to generate AI-based responses by efficiently scanning large volumes of Yelp reviews. Below is a detailed summary of how I implemented this system using embeddings, Google Firestore, Google Cloud Storage, Google Functions, and OpenAI.
What I did
Created text embeddings of the Yelp reviews using Sentence-BERT (S-BERT).
Stored embeddings in Google Cloud Storage.
Stored review metadata in Google Firestore for easy querying
Created API endpoint that processes user input from Google Sheets and generates AI responses.
Loads embeddings, filters metadata, and calls OpenAI for response generation.
Summary
An API created to generate a standard JSON Schema from sample data, web site schema or other source to save people time from having to structure it for mapping , Google Big Query, creating reference documentation or other schema dependent use cases
What I did
Created the Google Cloud Function API endpoint using Python that imports the vertex ai library or the chatgpt library
Created a JSON payload schema for sending AI instructions and configurations
Prompt engineered the instructions for synthesizing the source data into a JSON Schema format
Created a demo app using Google Sheets to construct the JSON instructions, send the API request and output the JSON schema response