aboutsummaryrefslogtreecommitdiff
path: root/content/posts/2017-04-17-what-i-ve-learned-developing-ad-server.md
diff options
context:
space:
mode:
Diffstat (limited to 'content/posts/2017-04-17-what-i-ve-learned-developing-ad-server.md')
-rw-r--r--content/posts/2017-04-17-what-i-ve-learned-developing-ad-server.md199
1 files changed, 199 insertions, 0 deletions
diff --git a/content/posts/2017-04-17-what-i-ve-learned-developing-ad-server.md b/content/posts/2017-04-17-what-i-ve-learned-developing-ad-server.md
new file mode 100644
index 0000000..966788a
--- /dev/null
+++ b/content/posts/2017-04-17-what-i-ve-learned-developing-ad-server.md
@@ -0,0 +1,199 @@
1---
2title: What I've learned developing ad server
3url: /what-i-ve-learned-developing-ad-server.html
4date: 2017-04-17T12:00:00+02:00
5type: post
6draft: false
7---
8
9For the past year and half I have been developing native advertising server that
10contextually matches ads and displays them in different template forms on
11variety of websites. This project grew from serving thousands of ads per day to
12millions.
13
14The system is made from couple of core components:
15
16- API for serving ads,
17- Utils - cronjobs and queue management tools,
18- Dashboard UI.
19
20Initial release was using [MongoDB](https://www.mongodb.com/) for full-text
21search but was later replaced by [Elasticsearch](https://www.elastic.co/) for
22better CPU utilization and better search performance. This provided us with many
23amazing functionalities of [Elasticsearch](https://www.elastic.co/). You should
24check it out if you do any search related operations.
25
26Because the premise of the server is to provide native ad experience, they are
27rendered on the client side via simple templating engine. This ensures that ads
28can be displayed number of different ways based on the visual style of the
29page. And this makes JavaScript client library quite complex.
30
31So now that you know basic information about the product lets get into the
32lessons we learned.
33
34## Aggregate everything
35
36After beta version was released everything (impressions, clicks, etc) was
37written in nanosecond resolution in the database. At that time we were using
38[PostgreSQL](https://www.postgresql.org/) and database quickly grew way above
39200GB in disk space. And that was problematic. Statistics took disturbingly long
40time to aggregate. Also using indexes on stats table in database was no help
41after we reached 500 million datapoints.
42
43> There is a marketing product information and there is real life experience.
44And the tend to be quite the opposite.
45
46This was the reason that now everything is aggregated on daily basis and this
47data is then fed to Elastic in form of daily summary. With this we achieved we
48can now track many more dimensions such as zone, channel and platform
49information. And with this information we can now adapt occurrences of ads on
50specific places more precisely.
51
52We have also adapted [Redis](https://redis.io/) as a full-time citizen in our
53stack. Because Redis also stores information on a local disk we have some sort
54of backup if server would accidentally suffer some failure.
55
56All the real-time statistics for ad serving and redirecting is presented as
57counters in Redis instance and daily extracted and pushed to Elastic.
58
59## Measure everything
60
61The thing about software is that we really don't know how well it is performing
62under load until such load is presented. When testing locally everything is fine
63but when on production things tend to fall apart.
64
65As a solution for this we are measuring everything we can. Function execution
66time (by encapsulating functions with timers), server performance (cpu, memory,
67disk, etc), Nginx and [uWSGI](https://uwsgi-docs.readthedocs.io/) performance.
68We sacrifice a bit of performance for the sake of this information. And we store
69all this information for later analysis.
70
71**Example of function execution time**
72
73```json
74{
75 "get_final_filtered_ads": {
76 "counter": 1931250,
77 "avg": 0.0066143431,
78 "elapsed": 12773.9500310003
79 },
80 "store_keywords_statistics": {
81 "counter": 1931011,
82 "avg": 0.0004605267,
83 "elapsed": 889.2821669996
84 },
85 "match_by_context": {
86 "counter": 1931011,
87 "avg": 0.0055960716,
88 "elapsed": 10806.0758889999
89 },
90 "match_by_high_performance": {
91 "counter": 262,
92 "avg": 0.0152770229,
93 "elapsed": 4.00258
94 },
95 "store_impression_stats": {
96 "counter": 1931250,
97 "avg": 0.0006189991,
98 "elapsed": 1195.4419869999
99 }
100}
101```
102
103We have also started profiling with [cProfile](https://pymotw.com/2/profile/)
104and then visualizing with [KCachegrind](http://kcachegrind.sourceforge.net/).
105This provides much more detailed look into code execution.
106
107## Cache control is your friend
108
109Because we use Javascript library for rendering ads we rely on this script
110extensively and when in need we need to be able to change behavior of the script
111quickly.
112
113In our case we can not simply replace javascript url in html code. It usually
114takes a day or two for the guys who maintain sites to change code or add
115?ver=xxx attribute. And this makes rapid deployment and testing very difficult
116and time consuming. There is a limitation of how much you can test locally.
117
118We are now in the process of integrating [Google Tag
119Manager](https://www.google.com/analytics/tag-manager/) but couple of websites
120are developed on ASP.net platform that have some problems with tag manager. With
121a solution below we are certain that we are serving latest version of the
122script.
123
124And it only takes one mistake and users have the script cached and in case of
125caching it for 1 year you probably know where the problem is.
126
127```nginx
128# nginx ➜ /etc/nginx/sites-available/default
129location /static/ {
130 alias /path-to-static-content/;
131 autoindex off;
132 charset utf-8;
133 gzip on;
134 gzip_types text/plain application/javascript application/x-javascript text/javascript text/xml text/css;
135 location ~* \.(ico|gif|jpeg|jpg|png|woff|ttf|otf|svg|woff2|eot)$ {
136 expires 1y;
137 add_header Pragma public;
138 add_header Cache-Control "public";
139 }
140 location ~* \.(css|js|txt)$ {
141 expires 3600s;
142 add_header Pragma public;
143 add_header Cache-Control "public, must-revalidate";
144 }
145}
146```
147
148Also be careful when redirecting to url in your python code. We noticed that if
149we didn't precisely setup cache control and expire headers in response we didn't
150get the request on the server and therefore couldn't measure clicks. So when
151redirecting do as follows and there will be no problems.
152
153```python
154# python ➜ bottlepy web micro-framework
155response = bottle.HTTPResponse(status=302)
156response.set_header("Cache-Control", "no-store, no-cache, must-revalidate")
157response.set_header("Expires", "Thu, 01 Jan 1970 00:00:00 GMT")
158response.set_header("Location", url)
159return response
160```
161
162> Cache control in browsers is quite aggressive and you need to be precise to
163avoid future problems. We learned that lesson the hard way.
164
165## Learn NGINX
166
167When deciding on a web server we went with Nginx as a reverse proxy for our
168applications. We adapted micro-service oriented architecture early in the
169project to ensure when we scale we can easily add additional servers to our
170cluster. And Nginx was crucial to perform load balancing and static content
171delivery.
172
173At first our config file was quite simple and later grew larger. After patching
174and adding new settings I sat down and learned more about the guts of Nginx.
175This proved to be very useful and we were able to squeeze much more out of our
176setup. So I advise you to take your time and read through the
177[documentation](https://nginx.org/en/docs/). This saved us a lot of headache.
178Googling for solutions only goes so far.
179
180## Use Redis/Memcached
181
182As explained above we are using caching basically for everything. It is the
183corner stone of our services. At first we were very careful about the quantity
184of things we stored in [Redis](https://redis.io/). But we later found out that
185the memory footprint is very low even when storing large amount of data in it.
186
187So we gradually increased our usage to caching whole HTML outputs of dashboard.
188This improved our performance in order of magnitude. And by using native TTL
189support this goes hand in hand with our needs.
190
191The reason why we choose [Redis](https://redis.io/) over
192[Memcached](https://memcached.org/) was the nature of scalability of Redis out
193of the box. But all this can be achieved with Memcached.
194
195## Conclusion
196
197There are a lot more details that could have been written and every single topic
198in here deserves it's own post but you probably got the idea about the problems
199we faced.