Search Engine is a personal, educational project and functional search engine. It is a work-in-progress.
The purpose is to familiarise myself with some more technology and demonstrate an understanding of fundamental software engineering concepts.
The intention is to mimic a complete Agile process, create the monolithic application, including tests, and then
simulate a "scale up" scenario by splitting the monolithic application into microservices.
Initially, the project employs a monolithic architecture. The components are built using the proxy design pattern that
allows them to remain decoupled and communicate - this allows for extracting components into separate services when it
needs to scale, and simplifies the transition to a microservice architecture.
The project adheres to SOLID principles. [3]
-
Single Responsibility Principle
"A class should have one, and only one, reason to change."
-
Open-Closed Principle
"You should be able to extend a classes behaviour without modifying it."
-
Liskov Substitution Principle
"Derived classes must be substitutable for their base classes."
-
Interface Segregation Principle
"Make fine-grained interfaces that are client-specific."
-
Dependency Inversion Principle
"Depend on abstractions, not on concretions."
The project adopts an Agile approach to quickly build components and work in small iterations to have a working
prototype as fast as possible. User Stories were created to gather requirements and indicate the set of high-level
components needed; these were then turned into cards and organised on a Kanban board to aid development.
We answer two things: what do we need to develop, and how completely would it meet our requirements. User Stories
use the template from Atlassian. [2]
As an... end-user, I need to be able to... submit new links, So as to... update the link graph and make their contents searchable.
Acceptance criteria:
- A frontend or API endpoint for end-users to submit links.
- Submitted links must be added to the graph and must be crawled by the system and indexed.
- Already submitted links should be accepted by the backend but not inserted twice into the graph.
As an... end-user, I need to be able to... submit full-text search queries, So as to... retrieve a list of relevant results from the indexed content.
Acceptance criteria:
- A frontend or API endpoint for end-users to submit a full-text query.
- Paginate results when a query returns multiple matches.
- Each entry in the result list must contain the following items: title or link description, the link to the content, and a timestamp indicating when the link was last crawled.
- When the query results in no matches, an appropriate response must be returned to the user.
As a... crawler backend system, I need to be able to... obtain a list of sanitised links, so as to... fetch and index their contents while also expanding the link graph with newly discovered links..
Acceptance criteria:
- The crawler can query the link graph and receive a list of stale links that need to be crawled.
- Links received by the crawler are retrieved from the remote hosts unless the remote server provides an ETag or Last-Modified header that the crawler has already seen before.
- Retrieved content is scanned for links, and the link graph gets updated.
- Retrieved content is indexed and added to the search corpus.
As a... PageRank calculator backend system, I need to be able to... access the link graph, So as to... calculate and persist the PageRank score for each link.
Acceptance criteria:
- The PageRank calculator can obtain an immutable snapshot of the entire link graph.
- A PageRank score is assigned to every link in the graph.
- The search corpus entries are annotated with the updated PageRank scores.
As a... Member of the engineering team, I need to be able to... monitor the health of the application and services, So as to... detect and address application issues.
Acceptance criteria:
- All services should periodically submit health- and performance-related metrics to a centralised metrics collection system.
- A monitoring dashboard is created for each service.
- A high-level monitoring dashboard tracks the overall system health.
- Metric-based alerts are defined and linked to a paging service.
| Metric | Expectation | Measurement Period | Notes |
|---|---|---|---|
| Website availability | 99% uptime | Yearly | Tolerates up to 3d 15h 39m of downtime per year |
| Index service availability | 99.9% uptime | Yearly | Tolerates up to 8h 45m of downtime per year |
| PageRank service availability | 70% uptime | Yearly | Not a user-facing component of our system; the service can endure longer periods of downtime |
| Search response time | 30% requests answered in 0.5s | Monthly | |
| Search response time | 70% requests answered in 1.2s | Monthly | |
| Search response time | 99% requests answered in 2.0s | Monthly | |
| CPU utilisation for PageRank service | 90% | Weekly | should not be paying for idle computing nodes |
UML component diagram for the search engine. [1]
As this is a search engine that catalogues links and their connections to each other, a graph-based model is an
optimum choice for the link store.
ER diagram for the link graph component. [1]
This gets implemented as an in-memory store to aid with running tests on the link graph component; this allows it to
stay self-contained and avoid spinning up additional database instances for testing or demonstration.
Additionally, it uses a database-backed graph implementation with CockroachDB as the primary persistence store.
What is CockroachDB? Here is the official description as provided by the CockroachDB docs [4]:
CockroachDB is a distributed SQL database built on a transactional and strongly consistent key-value store. It scales horizontally; survives disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention; supports strongly-consistent ACID transactions; and provides a familiar SQL API for structuring, manipulating, and querying data.
This means that the database scaled horizontally with very little overhead and only requires adding additional nodes. Clusters will automatically balance themselves to nodes with more capacity.
Every transaction in CockroachDB guarantees ACID semantics spanning arbitrary tables and rows, even when data is distributed.
Taken from Wikipedia:
ACID (atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data) is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction. [5]
We also get PostgreSQL like query syntax, and the benefit of using the pure-Go Postgres package to connect to the database.
Migrations will use the gomigrate tool [6]. This tool helps to update or rollback the database schema
from the command line.
CockroachDB migrations are found in: linkgraph/store/cockroachdb/migrations
In order to run migrations you need an environmental variable with the connection string. We can then run the migrate command manually or use the make file:
dan@Sol:~/search-engine$ export CDB_MIGRATE='cockroachdb://root@localhost:26257/linkgraph?sslmode=disable'
dan@Sol:~/search-engine$ make db-migrations-up
1/u create_links_table (40.725ms)
2/u create_edges_table (168.5279ms)
dan@Sol:~/search-engine$ make db-migrations-down
Are you sure you want to apply all down migrations? [y/N]
Applying all down migrations
2/d create_edges_table (161.7728ms)
1/d create_links_table (275.748ms)All tests can be run by using the Makefile command:
make test
dan@Sol:~/github.com/kyteproject/search-engine$ make test
[go test] running tests and collecting coverage metrics
=== RUN Test
OK: 10 passed
--- PASS: Test (7.38s)
PASS
coverage: 83.1% of statements
ok search-engine/linkgraph/store/cockroachdb
7.398s coverage: 83.1% of statements
=== RUN Test
OK: 10 passed
--- PASS: Test (0.31s)
PASS
coverage: 100.0% of statements
ok search-engine/linkgraph/store/memory
0.327s coverage: 100.0% of statementsThis project and information is sourced primarily from the book Hands-On Software Engineering with Golang by Achilleas Anagnostopoulos. All copyrighted material referenced in this writeup is for educational use and should be taken as my personal notes. No infringement is intended.
- Hands-On Software Engineering with Golang - Achilleas Anagnostopoulos [Packt Publishing]
- https://www.atlassian.com/agile/project-management/user-stories
- https://team-coder.com/solid-principles/
- https://www.cockroachlabs.com/docs/v20.2/frequently-asked-questions.html#what-is-cockroachdb
- https://en.wikipedia.org/wiki/ACID
- https://github.com/golang-migrate/migrate