I would like to know a system by which I can keep track of multiple aws accounts, somewhere around 130+ accounts with each account containing around 200+ servers.
I wanna know methods to keep track of machine failure, service failure etc.
I also wanna know methods by which I can automatically turn up a machine if the underlying hardware failed or the machine terminated while on spot.
I'm open to all solutions including chef/terraform automation, healing scripts etc.
You guys will be saving me a lot of sleepless nights :)
Thanks in advance!!
Manage multiple aws accounts
506 Views Asked by Shardool Singh At
2
There are 2 best solutions below
0
Tim
On
AWS Organisations are useful for management. You can also look at multiple account billing strategy and security strategy. A shared services account with your IAM users will make things easier.
Regarding tracking failures you can set up automatic instance recovery using CloudWatch. CloudWatch can also have alerts defined that will email you when something happens you don't expect, though setting them up individually could be time consuming. At your scale I think you should look into third party tools.
Related Questions in AMAZON-WEB-SERVICES
- S3 integration testing
- How to get content of BLOCK types LAYOUT_TITLE, LAYOUT_SECTION_HEADER and LAYOUT_xx in Textract
- Error **net::ERR_CONNECTION_RESET** error while uploading files to AWS S3 using multipart upload and Pre-Signed URL
- Failed to connect to your instance after deploying mern app on aws ec2 instance when i try to access frontend
- AWS - Tab Schema Conversion don't show up after creating a Migration Project
- Unable to run Bash Script using AWS Custom Lambda Runtime
- Using Amazon managed Prometheus to get EC2 metrics data in Grafana
- AWS Dns record A not navigate to elb
- Connection timed out error with smtp.gmail.com
- AWS Cognito Multi-tenant Integration | Ok to use Client’s Idp?
- Elasticbeanstalk FastAPI application is intermittently not responding to https requests
- Call an External API from AWS Lambda
- Why my mail service api spring isnt working?
- export 'AWSIoTProvider' (imported as 'AWSIoTProvider') was not found in '@aws-amplify/pubsub'
- How to take first x seconds of Audio from a wav file read from AWS S3 as binary stream using Python?
Related Questions in AUTOMATION
- Applescript To Select Sound Output stops at opening Sound Preferences Screen
- I am automating web scraping using python
- Autofill data from previous cell to next cell in openpyxl
- In spotfire, IronPython script: No Module named Selenium
- Python selenium automation browser
- Specflow defination not showing references
- Expect: Any way to match a specific rule only once?
- Automate the update of a pivot table in Excel via Power Automate Web
- Encountering a problem to interact with a weird button which is a combobox (select)
- Unable to Login through Automation(Cypress) to app, while the credentails are true. It allows manual login but unable to login through Cypress
- Selecting an option in the mobile app drop down which is not visible when the app is loaded for the first time
- Unable to launch WebDriverAgent
- How do I automate a video download with Selenium and Python (Meta Quest Store Trailer Download)
- Error: Could not start a new session. Possible causes are invalid address of the remote server or browser start-up failure
- Trouble uploading with playwright
Related Questions in MONITORING
- Monitoring Thread pool metrics through promethues
- Filter input metrics in vmagent (prometheus)
- Trying to get net.if.in and net.if.out values with zabbix api python
- Global event monitoring with WPF
- database "telegraf" creation failed: 401 Unauthorized
- Zabbix parsing macros value
- Is it possible for my prometheus container to pull metrics from Azure Monitor?
- APM Open source : Angular + Java Spring + Postgresql
- Poller is not picking up the Queued tasks, the Host and Service checks are getting timed out
- Can I monitor progress of spacy parsing?
- What's the difference between every 1m, group_by in MQL Alert vs rolling window in Google alerting
- Objective tools for monitoring WCF APIs for latency, failures, and breakdowns?
- Retain Metric Values in Prometheus TSDB Across Application Restarts?
- Grafana Base64 Image/Video/Audio/PDF plugins unable to display
- How do I measure pagespeed scores on my pages using datadog? Or rather, is it even possible to keep track of pagespeed scores?
Related Questions in FAILOVER
- Set host-ip list in python-flask server
- Is there any way to check webpage content health check in AWS lambda/drs/ec2?
- How can I minimize data loss & data transfer when I failback with the requirement that I promote the old primary back to its primary status?
- ActiveMQ Classic 5.18.3 running in a fail-over configuration using SQL Server as backend datastore throws Primary Key violation error
- WSFC with SQL transactional replication issue
- Connection pool configuration with Jedis MultiClusterPooledConnectionProvider
- Spring Data Redis 3.2.2 support for Jedis 5.0.2 MultiClusterClientConfig
- Spring Boot + ReactJS + ActiveMQ Classic - STOMP failover support for existing websocket (chat) connections
- Lettuce client not able to detect Redis failure immediately
- DNS Routing Failover Route53 to deploy .NET Windows Service in Active/Passive manner
- Issue with "10010 TXN_PARTITION_STATE_UNMATCH" Error in GridDB During Client Request Execution
- 'CHECKPOINT_SHUTDOWN' Log Entries in PostgreSQL During Master-Slave Replication and Server Failures"
- How to test redis failover with redis sentinel and docker compose?
- In Jenkins, how can I configure the build to fail when error codes fall within the range of 206 to 599 in JMeter performance tests
- Pgpool Server Failover
Related Questions in SELF-HEALING
- Automatically turn a string into a formatted string? (Python)
- WARN com.epam.healenium.client.RestClient - Failed to make response of 'getLastHealingData' request
- Self healing task group in Python
- Healenium Implementation for Javascript WDIO
- how to set-up liveness and readiness probes for Celery worker pods
- involuntary disruptions / SIGKILL handling in microservice following saga pattern
- is docker-compose a self healing orchestrator?
- Is probing of a Pod retried after a readiness probe fails
- Manage multiple aws accounts
- checkpointing in python to catch the runtime state
- Consul watch with critical consul checks
- web.config <system.webServer> error
- self healing in centralized logging
- Linux self-healing script to check some process
- Apache spark job failed immediately without retry, setting maxFailures doesn't work
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
This is purely my take on implementing your problem statement.
1) Well.. for managing and keeping track of multiple aws accounts you can use AWS Organization. This will help you manage centrally with one root account all the other 130+ accounts. You can enable consolidated billing as well.
2) As far as keeping track of failures... you may need to customize this according to your requirements. For example: You can build a micro service on top of
docker containers or ecswhose sole purpose is to keep track of failures, generate a report and push tos3on a daily basis.You can further create a dashboard usingAWS quicksightout of this reports in S3.There can be another micro service which will rectify the failures. It just depends on how exhaustive and fine grained you want your implementation to be.
3) For spawning instances when spot instances are terminated, it can be achieved through you simple autoscaling configurations. Here are some of the articles you may want to go through which will give you some ideas:
Using Spot Instances with On-Demand instances
Optimizing Spot Fleet+Docker with High Availability